0% found this document useful (0 votes)
259 views66 pages

Aws Data Engineer

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
259 views66 pages

Aws Data Engineer

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

AWS DATA ENGINEERING VIRTUAL INTERNSHIP

Internship report submitted in partial fulfilment of requirements for the


award of degree of
BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE AND ENGINEERING

by
SAMMATHAMU SESHANK (20131A05L9)
Under the esteemed guidance of
Name of the Course Coordinator Name of the Internship Mentor
Dr. Ch. Sita Kumari Mr. R. Siva Kumar
Associate Professor Assistant Professor

Department of Computer Science and Engineering


GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING
(AUTONOMOUS)

(Affiliated to JNTU-K, Kakinada)


VISAKHAPATNAM
2023– 2024

i
Gayatri Vidya Parishad College of Engineering (Autonomous)
Visakhapatnam

CERTIFICATE
This report on
“AWS DATA ENGINEERING VIRTUAL INTERNSHIP”
is a bonafide record of the Internship work submitted
by
SAMMATHAMU SESHANK (20131A05L9)
In their VIII semester in partial fulfilment of the requirements for the Award
of Degree of
Bachelor of Technology in Computer Science and Engineering

During the academic year 2023-2024

Name of the course coordinator Head of The Department


Dr. Ch. Sita Kumari Dr. D. Uma Devi
Associate Professor Associate Professor & I/C HOD

Internship mentor
Mr. R. Siva Kumar
Assistant professor

ii
ACKNOWLEDGEMENT

We would like to express our deep sense of gratitude to our esteemed institute
Gayatri Vidya Parishad College of Engineering (Autonomous), which has provided us
an opportunity to fulfil our cherished desire.

We thank our Course coordinator Dr. CH SITA KUMARI Associate Professor,


Department of Computer Science and Engineering, for the kind suggestions and
guidance for the successful completion of our internship.

We thank our internship mentor, Ms. R. Siva Kumar Assistant Professor,


Department of Computer Science and Engineering, for the kind suggestions and
guidance for the successful completion of our internship.

We are highly indebted to Dr. D. UMA DEVI, Associate Professor, Incharge


Head of the Department of Computer Science and Engineering and Department of
Computer Science and Engineering(AI-ML) Gayatri Vidya Parishad College of
Engineering (Autonomous), for giving us an opportunity to do the internship in
college.

We express our sincere thanks to our Principal Dr. A. B. KOTESWARA RAO,


Gayatri Vidya Parishad College of Engineering (Autonomous) for his encouragement
to us during this project, giving us a chance to explore and learn new technologies in
the form of mini project.

Finally, we are indebted to the teaching and non-teaching staff of the Computer
Science and Engineering Department for all their support in the completion of our
project.

SAMMATHAMU SESHANK (20131A05L9)

iii
iv
ABSTRACT

Data engineering is the science of analysing raw data to make conclusions about that
information. Data engineering relies on a variety of software tools ranging from
spreadsheets, data visualization, and reporting tools, data mining programs, or open-
source languages for the greatest data manipulation. Data Analysis mainly deals with data
collection, data storage, data preprocessing and data visualisation.

In the course-1 we have learnt about the cloud computing, Cloud computing is the on-
demand delivery of compute power, database, storage, applications, and other IT
resources via the internet with pay-as-you-go pricing. These resources run on server
computers that are located in large data centers in different locations around the world.
When you use a cloud service provider like AWS, that service provider owns the
computers that you. This course deals with the following main concepts of compute
services, storge services, management services, database services, compliance services,
AWS cost management services

As a part of our course-2 we learnt data engineering which deals with the raw data to draw
solutions. In this course we have learnt about big data which is the main and foremost
important tool for the data analysis so, this big data is very important for data engineering,
and here comes the problem of storing the data for that we have learnt different tools used
for data storage and how to analyse the big data and preprocess the data the following are
the concepts that we learnt in this course. The main concepts includes storage using
AMAZON includes amazon S3, amazon athena, amazon redshift, amazon glue, amazon
sagemaker and amazon IOT analysis.

v
Index

Sl. No Topic name Page number

COURSE-1
CLOUD FOUNDATION

1 Introduction to cloud computing 1-3


1.1-What is cloud computing
1.2-Traditional computing vs cloud computing
1.3-Introduction to AWS
1.4-AWS CAF
2 Cloud economics and billing 4-5
2.1-What is AWS
2.2-Paying foe resources in AWS
2.3-AWS CTO
2.4-AWS organizations
3 AWS global infrastructure 6
3.1-AWS infrastructure
3.2-AWS foundational services
4 AWS cloud security 7-8
4.1-AWS cloud security
4.2-AWS shared responsibility model
4.3-IAM
4.4-Security services
5 Networking and content delivery 9-10
5.1-Networking basics
5.2-Amazon VPC
5.3-Amazon Route 53
5.4-Amazon cloud front
6 Compute 11-12
6.1-Compute services
6.2-Amazon EC2
6.3-Container services
7 Storage 13
7.1-Amazon EBS
7.2- Amazon S3
7.3- Amazon EFS
7.4- Amazon S3 Glacier

vi
8 Databases 14-15
8.1-Relational database services
8.2-Cloud architecture
8.3-AWS trusted adviser
8.4-Automatic scaling and Monitoring

9 Cloud Architecture 16
9.1-Cloud architects
9.2-Reliability and Availability
10 Auto scaling and monitoring 17-18
10.1-Elastic load balance
10.2-Elastic cloud watch
10.3 -Elastic EC2 auto scaling
COURSE-2 DATA ENGINEERING

11 Introduction to data Engineering 19-20

12 Data driven Organizations 21-22

13 The Elements of data 23-24

14 Design Principles and patterns for data pipelines 25-26

15 Securing and Scaling the Data Pipeline 27-28

16 Ingesting and preparing data 29-30

17 Ingesting by Batch or by Stream 31-32

18 Storing and Organizing Data 33-34

19 Processing Big Data 35-36

20 Processing Data for ML 37-38

21 Analyzing and Visualizing Data 39-40

vii
22 Automating the Pipeline 41-42

23 Labs 43-51

24 Case study 52-54

25 Conclusion 55

26 Reference Links 56

viii
CLOUD FOUNDATIONS

1. INTRODUCTION TO CLOUD COMPUTING

1.1 What is cloud computing?

Cloud computing is the on-demand delivery of compute power, database, storage,


applications, and other IT resources via the internet with pay-as-you-go pricing. These
resources run on server computers that are located in large data centers in different
locations around the world. When you use a cloud service provider like AWS, that
service provider owns the computers that you are using. These resources can be used
together like building blocks to build solutions that help meet business goals and satisfy
technology requirements.

The services provided by cloud computing are:


IaaS

PaaS

SaaS

1.2 Differences between traditional computing and cloud computing.

Traditional Computing model

1. Infrastructure as hardware

2. Hardware Solutions:

i. Require Space, Staff, physical security, planning, Capital expenditure.

ii. Have a long hardware procurement cycle

iii. Require you to provision capacity by guessing theoretical maximum peaks

Cloud computing model

•Infrastructure as software

1
•Software solutions:
1. Are flexible

2. Can change more quickly, easily, and cost-effectively than hardware solutions

3. Eliminate the undifferentiated heavy-lifting tasks

\
Fig-1.2.1 Cloud Service models

Advantages of Cloud computing

Trade capital expense for variable expense Benefit from massive economies of scale Stop
guessing capacity
• Increase speed and agility Go global in minutes
• Stop spending money on running and maintaining data centers

1.3 Introduction to AWS (Amazon Web Services)

Amazon Web Services (AWS) is a secure cloud platform that offers a broad set of global
cloud based products. Because these products are delivered over the internet, you have
on-demand access to the compute, storage, network, database, and other IT resources
that you might need for your projects— and the tools to manage them. AWS offers
flexibility. Your AWS environment can be reconfigured and updated on demand, scaled
up or down automatically to meet usage patterns and optimize spending, or shut down
temporarily or permanently.

2
Fig-1.3.1 Services covered in the course

1.4 AWS cloud adoption framework (AWS CAF)

AWS CAF provides guidance and best practices to help organizations build a
comprehensive approach to cloud computing across the organization and throughout the
IT lifecycle to accelerate successful cloud adoption.

1. AWS CAF is organized into six perspectives.


2. Perspectives consist of sets of capabilities.

3
2. CLOUD ECONOMICS AND BILLING

2.1 What is AWS?


AWS is designed to allow application providers, ISVs, and vendors to quickly and
securely host your applications – whether an existing application or a new SaaS-based
application. You can use the AWS Management Console or well-documented web
services APIs to access AWS's application hosting platform.

2.2 How we pay for the resources used in AWS?

AWS realizes that every customer has different needs. If none of the AWS pricing
models work for your project, custom pricing is available for high-volume projects with
unique requirements. There are some rules associated with the amount paying to aws:
•There is no charge (with some exceptions) for:
•Inbound data transfer.
•Data transfer between services within the same AWS Region.
•Pay for what you use.
•Start and stop anytime.
•No long-term contracts are required.
•Some services are free, but the other AWS services that they provision might not be
free
2.3 What is TCO?
Total Cost of Ownership (TCO) is the financial estimate to help identify direct and
indirect costs of a system.

Why use TCO?


• To compare the costs of running an entire infrastructure environment or specific
workload on- premises versus on AWS
• To budget and build the business case for moving to the cloud.

4
Use the AWS Pricing Calculator to:

•Estimate monthly costs

•Identify opportunities to reduce monthly costs

•Model your solutions before building them


•Explore price points and calculations behind your estimate
•Find the available instance types and contract terms that meet your needs

•Name your estimate and create and name groups of services

2.4 AWS organizations:


AWS Organizations is a free account management service that enables you to consolidate
multiple AWS accounts into an organization that you create and centrally manage. AWS
Organizations include consolidated billing and account management capabilities that
help you to better meet the budgetary, security, and compliance needs of your business.

5
3. AWS GLOBAL INFRASTRUCTURE

The AWS Global Infrastructure is designed and built to deliver a flexible, reliable,
scalable, and Secure cloud computing environment with high-quality global network
performance. The AWS Cloud infrastructure is built around regions.

3.1 AWS infrastructure features:


Elasticity and scalability
•Elastic infrastructure; dynamic adaption of capacity

•Scalable infrastructure; adapts to accommodate growth


Fault-tolerance

•Continues operating properly in the presence of a failure


•Built-in redundancy of components

High availability

•High level of operational performance


• Minimized downtime

•No human intervention

Fig-3.1.1 Foundational services

6
4. AWS CLOUD SECURITY

4.1 AWS cloud security:


Cloud security is a collection of security measures designed to protect cloud-based
infrastructure applications and data. AWS provide security and services that help you
protect your data,accounts, workloads form unauthorized access.

4.2 AWS shared responsibility model:


This shared model can help relieve the customers operational burden as AWS operates,
manages and controls the components from the host operating system. The
responsibility of this model is protecting infrastructure that runs all the services offered
in the AWS Cloud. The infrastructure composed of hardware, software, networking and
facilities that run AWS Cloud services.

4.3 IAM:
With AWS identity and access management (IAM) you can specify who or what can
access services and resources in AWS. IAM is a web service that helps securely control
access to AWS resources. We use IAM control who is authenticated and authorized to
use resources.
There are two types of IAM policies:
1. AWS managed policies
2. Customer managed policies

Securing a new AWS account:


1. Safeguard your password and access keys
2. Activate multi-factor authentication on AWS account root user and any users with
interactive access to IAM
3. Limit AWS account root user access to your resources itself

4.4 Security accounts:


We need to secure accounts because if hacker cracks your passwords they could gain
access to social medial accounts, bank accounts, e-mails and all other accounts that

7
holds personal data. If someone obtain the access to the information you could become
the victim of identity theft. Make your account more secure by –

•Update account recovery options


•Remove risky access to your data

8
5. NETWORKING AND CONTENT DELIVERY

5.1 Networking basics


A computer network is two or more client machines that are connected together to share
resources. A network can be logically partitioned into subnets. Networking requires a
networking device (such as a router or switch) to connect all the clients together and
enable communication between them. Each client machine in a network has a unique
Internet Protocol (IP) address that identifies it. A 32bit IP address is called an IPv4
address. A 128-bit IP address is called an IPv6 address.

Fig-5.1.1 OSI model

5.2 Amazon VPC:


VPC Amazon Virtual Private Cloud (Amazon VPC) is a service that lets you provision
a logically isolated section of the AWS Cloud (called a virtual private cloud, or VPC)
where you can launch your AWS resources.

5.3 Amazon route 53:


Amazon Route 53 is a highly available and scalable cloud Domain Name System (DNS)
web service. It is designed to give developers and businesses a reliable and cost-
effective way to route users to internet applications by translating names (like
www.example.com) into the numeric IP addresses (like 192.0.2.1) that computers use
to connect to each other.

9
Amazon Route 53 supports several types of routing policies, which determine how
Amazon Route 53 responds to queries:

•Simple routing (round robin)

•Weighted round robin routing

•Latency routing (LBR)


• Geolocation routing

5.4 Amazon CloudFront


Amazon CloudFront is a fast CDN service that securely delivers data, videos,
applications, and application programming interfaces (APIs) to customers globally with
low latency and high transfer speeds. Amazon CloudFront is a self-service offering with
pay as-you-go pricing.

10
6. COMPUTE

6.1 Compute services overview


Amazon Web Services (AWS) offers many compute services. Here is a brief
summary of what each compute service offers:
•Amazon Elastic Compute Cloud (Amazon EC2) provides resizable virtual machines.
•Amazon Elastic Container Registry (Amazon ECR) is used to store and retrieve
Docker images.
•VMware Cloud on AWS enables you to provision a hybrid cloud without custom
hardware.
•AWS Elastic Beanstalk provides a simple way to run and manage web applications.
•AWS Lambda is a serverless compute solution. You pay only for the compute time
that you use. •Amazon Elastic Kubernetes Service (Amazon EKS) enables you to run
managed Kubernetes on AWS.
•Amazon Light sail provides a simple-to-use service for building an application or
website.
•AWS Batch provides a tool for running batch jobs at any scale.
•AWS Outposts provides a way to run select AWS services in your on-premises data
center.

Fig-6.1.1 Compute services overview

11
6.2 Amazon Ec2

Amazon Elastic Compute Cloud (Amazon EC2):

• Provides virtual machines—referred to as EC2 instances—in the cloud.

• Gives you full control over the guest operating system (Windows or Linux)
on each instance.
• You can launch instances of any size into an Availability Zone anywhere in the
world.

• Launch instances from Amazon Machine Images (AMIs).

• Launch instances with a few clicks or a line of code, and they are ready in
minutes.

• You can control traffic to and from instances.

6.3 Container services:


Containers are a method of operating system virtualization. Benefits are:

• Repeatable.

• Self-contained environments.

• Software runs the same in different environments.

• Developer's laptop, test, production.

• Faster to launch and stop or terminate than virtual machines.

12
7. STORAGE
7.1 Amazon EBS:
Amazon EBS provides persistent block storage volumes for use with Amazon
EC2 instances. Persistent storage is any data storage device that retains data after
power to that device is shutoff. It is also sometimes called non- volatile storage.
Each Amazon EBS volume is automatically replicated within its Availability
Zone to protect you from component failure. It is designed for high availability
and durability. Amazon EBS volumes provide the consistent and low- latency
performance that is needed to run your workloads.

7.2 Amazon S3:


Amazon S3 is object storage that is built to store and retrieve any amount of
data from anywhere: websites and mobile apps, corporate applications, and
data from Internet of Things (IoT) sensors or devices. Amazon S3 is object -
level storage, which means that if you want to change a part of a file, you must
make the change and then re-upload the entire modified file. Amazon S3 stores
data as objects within resources that are called buckets. The data that you store
in Amazon S3 is not associated with any particular server, and you do not need
manage any infrastructure yourself. You can put as many objects into Amazon
S3 as you want. Amazon S3 holds trillions of objects and regularly peaks at
millions of requests per second

7.3 Amazon EFS:


Amazon EFS implements storage for EC2 instances that multiple virtual
machines can access at the same time. It is implemented as a shared file system
that uses the Network File System (NFS) protocol. Amazon Elastic File
System (Amazon EFS) provides simple, scalable, elastic file storage for use
with AWS services and on - premises resources. It offers a simple interface that
enables you to create and configure file systems quickly and easily. Amazon
EFS is built to dynamically scale on demand without disrupting applications—
it will grow and shrink automatically as you add and remove files. It is
designed so that your applications have the storage they need, when they need
it.

13
8. Databases
8.1 Relational database service:

Amazon relational database service (Amazon RDS) is a collection of


managed services that make it simple to set up, operate, and scale databases
in the cloud. To address the challenges of running an unmanaged, standalone
relational database, AWS provides a service that sets up, operates, and scales
the relational database without any ongoing administration.
Amazon RDS provides cost efficient and resizable capacity, while
automating time consuming administrative tasks. Amazon RDS enables you
to focus on your application, so you can give applications the performance,
high availability, security, and compatibility that they need. With Amazon
RDS, your primary focus is your data and optimizing your application.
8.2 Cloud architecture:
•Engage with decision makers to identify the business goal and the capabilities
that need improvement
•Ensure alignment between technology deliverables of a solution and the
business goals.
• Work with delivery teams that are implementing the solution to ensure that
the technology features are appropriate
•A guide for designing infrastructures that are:
✓ Secure
✓ High performing
✓ Resilient
✓ Efficient
•A consistent approach to evaluating and implementing cloud architectures
•A way to provide best practices that were developed through lessons learned
by reviewing customer architectures

14
8.3 Aws trusted advisor:

AWS Trusted Advisor is an online tool that provides real time guidance to help
you provision your resources following AWS best practices.
1. Cost Optimization: AWS Trusted Advisor looks at your resource use and
makes recommendations to help you optimize cost by eliminating unused and
idle resources, or by making commitments to reserved capacity.
2.Performance: Improve the performance of your service by checking your
service limits, ensuring you take advantage of provisioned throughput, and
monitoring for overutilized instances.
3. Security: Improve the security of your application by closing gaps, enabling
various AWS security features, and examining your permissions.
4. Fault Tolerance: Increase the availability and redundancy of your AWS
application

8.4 Automatic scaling and monitoring:

Elastic Load Balancing:

Distributes incoming application or network traffic across multiple targets in


a single Availability Zone or across multiple Availability Zones.
Scales your load balancer as traffic to your application changes over time.

Types of load balancers:


1. Application load balancer

2. Classic load balancer


3. Network load balancer

15
9. CLOUD ARCHITECTURE
9.1 Cloud architects:
•Engage with decision makers to identify the business goal and the capabilities
that need improvement.
•Ensure alignment between technology deliverables of a solution and the
business goals.
• Work with delivery teams that are implementing the solution to ensure that the
technology
•A guide for designing infrastructures that are:
✓ Secure

✓ High performing

✓ Resilient

9.2 Reliability and Availability:

Fig-9.2.1 Reliability and Availability

16
10. AUTOMATIC SCALING AND MONITORING

10.1 Elastic load balancing:


Distributes incoming application or network traffic across multiple targets in
a single Availability Zone or across multiple Availability Zones.
Types of load balancers:
1. Application load balancer

2. Classic load balancer

3. Network load balancer

Elastic Load Balancing use cases:


1. Highly available Fault tolerant applications
2.Containerized applications
3. Elasticity and scalability

Fig-10.1.1 Elastic load balance

17
Fig-10.1.2 Load balancers

10.2 Amazon cloud watch:


Amazon CloudWatch helps you monitor your AWS resources and the
applications that you run on AWS in real time.

10.3 Amazon ec2 auto scaling:


Scaling is the ability to increase or decrease the compute capacity of your
application Amazon Ec2 Auto Scaling.

18
DATAENGINEERING
11. Introduction to data Engineering

Data engineering is the process of designing and building systems that let
people collect and analyze raw data from multiple sources and formats. These
systems empower people to find practical applications of the data, which
businesses can use to thrive.
Data engineering is a skill that is in increasing demand. Data engineers are the
people who design the system that unifies data and can help you navigate it.
Data engineers perform many different tasks including:

• Acquisition: Finding all the different data sets around the business
• Cleansing: Finding and cleaning any errors in the data
• Conversion: Giving all the data a common format
• Disambiguation: Interpreting data that could be interpreted in multiple
ways
• Deduplication: Removing duplicate copies of data

Once this is done, data may be stored in a central repository such as a data
lake or data lakehouse. Data engineers may also copy and move subsets of data
into a data warehouse.

19
20
12. Data Driven Organizations
The Genesis of Data-Driven Organizations:
The genesis of data-driven organizations can be traced back to the dawn of
the information age, where the advent of technology ushered in an era of
unprecedented data proliferation. Initially viewed as mere byproducts of digital
transactions, data soon evolved into a strategic asset, offering insights into
consumer behavior, market trends, and operational efficiency. Organizations began
to recognize the intrinsic value of data, laying the foundation for a paradigm shift
in decision-making – from intuition-driven to data-driven.

The Pillars of Data-Driven Culture:


At the heart of every data-driven organization lies a culture that embraces
data as a strategic imperative. This culture is built upon three foundational pillars:
data literacy, data transparency, and data-driven decision-making. Data literacy
entails equipping employees with the skills and knowledge needed to understand,
interpret, and analyze data effectively. Meanwhile, data transparency fosters a
culture of openness and accountability, where data is accessible, reliable, and
trustworthy. Finally, data-driven decision-making empowers stakeholders to
leverage data insights in every aspect of their decision-making process, driving
innovation and agility across the organization.

The Role of Technology in Data-Driven Transformation:


Technology serves as the backbone of data-driven transformation, providing
the infrastructure, tools, and platforms needed to harness the power of data
effectively. From advanced analytics and artificial intelligence to cloud computing
and big data technologies, organizations leverage a myriad of technological
innovations to unlock the full potential of their data assets. Moreover, the rise of
data management systems, such as data lakes and data warehouses, enables
organizations to consolidate, integrate, and analyze vast volumes of data in real-
time, facilitating informed decision-making at scale.

Challenges and Opportunities in the Data-Driven Journey:


Despite the promises of data-driven transformation, organizations face a
myriad of challenges on their journey towards becoming truly data-driven. These
challenges range from data silos and legacy systems to data privacy and security
21
concerns. Moreover, cultural resistance and organizational inertia often pose
significant barriers to adoption, hindering the realization of the full potential of
data-driven initiatives. However, amidst these challenges lie boundless
opportunities for innovation, differentiation, and competitive advantage.
Organizations that embrace the data-driven mindset, foster a culture of
experimentation, and invest in the right technologies stand poised to thrive in the
digital age.

The Future of Data-Driven Organizations:


As we gaze into the horizon of the future, the trajectory of data-driven
organizations appears both promising and uncertain. Rapid advancements in
technology, coupled with evolving regulatory landscapes, continue to shape the
contours of the data-driven journey. Moreover, the democratization of data and the
rise of citizen data scientists herald a new era of empowerment, where insights are
no longer confined to the realm of data experts but are accessible to all. Yet, amidst
this uncertainty, one thing remains clear – data will continue to be the driving force
behind organizational innovation, disruption, and transformation.

22
13. The Elements of Data

The Essence of Data:


At its essence, data embodies information – raw and unrefined, waiting to
be unlocked, interpreted, and harnessed. It comprises a myriad of elements, each
contributing to its richness, complexity, and utility. From structured data,
characterized by its organized format and predefined schema, to unstructured data,
defined by its fluidity and lack of formal structure, the diversity of data elements
mirrors the diversity of human experiences, thoughts, and interactions.

Structured Data:
Structured data represents the foundation of traditional databases,
characterized by its organized format, predictable schema, and tabular structure.
This form of data lends itself well to relational databases, where information is
stored in rows and columns, enabling efficient storage, retrieval, and analysis.
Examples of structured data include transaction records, customer profiles, and
inventory lists, each meticulously organized to facilitate easy access and
manipulation.

Unstructured Data:
In contrast to structured data, unstructured data defies conventional
categorization, encompassing a wide array of formats, including text, images,
videos, and audio recordings. This form of data lacks a predefined schema, making
it inherently more challenging to analyze and interpret. However, within the realm
of unstructured data lies a treasure trove of insights, waiting to be unearthed
through advanced analytics, natural language processing, and machine learning
algorithms.

Semi-Structured Data:

Semi-structured data occupies a unique space between structured and


unstructured data, combining elements of both. While it may possess some
semblance of organization, such as tags or metadata, it lacks the rigid structure of
traditional databases. Examples of semi-structured data include XML files, JSON

23
documents, and log files, each offering a flexible framework for storing and
exchanging information across disparate systems.

The Lifecycle of Data:


Beyond its structural composition, data traverses a lifecycle – from its
inception to its eventual obsolescence. This lifecycle encompasses five distinct
stages: capture, storage, processing, analysis, and dissemination. At each stage,
data undergoes transformation, refinement, and enrichment, evolving from mere
bytes to actionable insights that drive informed decision-making and strategic
planning.

24
14. Data Principles and Patterns for Data Pipelines

This exploration delves into the intricacies of data pipeline design, unveiling
the principles and patterns that enable organizations to orchestrate data flows
efficiently, reliably, and at scale.

The Foundation of Data Pipeline Design:


At the heart of effective data pipeline design lies a foundation built upon
three fundamental principles: scalability, reliability, and maintainability.
Scalability ensures that data pipelines can handle increasing volumes of data
without compromising performance or efficiency. Reliability guarantees that data
is processed accurately and consistently, even in the face of failures or disruptions.
Maintainability encompasses the ease with which data pipelines can be modified,
extended, and debugged over time, ensuring their longevity and adaptability in a
rapidly evolving landscape.

Design Patterns for Data Pipelines:


To achieve these principles, data engineers rely on a myriad of design
patterns that encapsulate best practices, strategies, and techniques for building
robust and efficient data pipelines. Among these patterns are:
1. Extract, Transform, Load (ETL): This classic pattern involves extracting data
from various sources, transforming it into a structured format, and loading it into a
destination for analysis or storage. ETL pipelines are well-suited for batch
processing scenarios, where data is collected and processed in discrete chunks.
2. Event-Driven Architecture: In contrast to batch processing, event-driven
architecture enables real-time data processing and analysis by reacting to events as
they occur. This pattern is ideal for scenarios where immediate insights or actions
are required, such as fraud detection, monitoring, or recommendation systems.
3. Lambda Architecture: Combining the strengths of both batch and stream
processing, the Lambda architecture provides a framework for building robust,
fault-tolerant data pipelines that can handle both historical and real-time data. By
leveraging batch and speed layers in parallel, organizations can achieve
comprehensive insights with low latency.

25
4. Microservices Architecture: In the realm of distributed systems, microservices
architecture offers a modular approach to building data pipelines, where individual
components or services are decoupled and independently deployable. This pattern
enables greater agility, scalability, and fault isolation, albeit at the cost of increased
complexity.

Challenges and Considerations:


Despite the benefits of these design patterns, data pipeline design is not without its
challenges. Organizations must grapple with issues such as data quality, latency,
scalability bottlenecks, and integration complexity. Moreover, as data pipelines
grow in complexity and scale, managing dependencies, orchestrating workflows,
and ensuring end-to-end visibility become increasingly challenging tasks.

26
15. Securing and Scaling the data pipeline

The Imperative of Data Pipeline Security:


Data pipeline security encompasses a multifaceted approach to safeguarding
data assets throughout their lifecycle – from ingestion to analysis to storage. At its
core, data pipeline security revolves around three key pillars: confidentiality,
integrity, and availability. Confidentiality ensures that data is accessible only to
authorized users, protecting it from unauthorized access or disclosure. Integrity
guarantees that data remains accurate and trustworthy, preventing unauthorized
modifications or tampering. Availability ensures that data is accessible and usable
when needed, minimizing downtime and disruptions.

Securing the Data Pipeline:


To secure the data pipeline effectively, organizations must adopt a layered
approach that encompasses both preventive and detective controls. This includes
implementing encryption mechanisms to protect data in transit and at rest,
enforcing access controls to limit user privileges and permissions, and deploying
monitoring and auditing tools to detect and respond to suspicious activities in real-
time. Additionally, organizations must adhere to industry standards and regulatory
requirements, such as GDPR, HIPAA, and PCI DSS, to ensure compliance and
mitigate legal and reputational risks.

Scaling the data pipeline involves expanding its capacity and capabilities to
accommodate growing volumes of data, users, and workloads. This requires a
strategic approach that encompasses both vertical and horizontal scaling
techniques. Vertical scaling involves adding more resources, such as CPU,
memory, or storage, to existing infrastructure to handle increased demand.
Horizontal scaling, on the other hand, involves distributing workloads across
multiple nodes or instances, enabling parallel processing and improved
performance.

Scalability Considerations:
While scalability unlocks new opportunities for growth and innovation, it
also introduces a host of challenges and considerations. Organizations must
carefully evaluate factors such as data volume, velocity, variety, and veracity, as

27
well as the underlying infrastructure, architecture, and technology stack. Moreover,
as data pipelines scale in complexity and size, managing dependencies, optimizing
performance, and ensuring fault tolerance become increasingly critical tasks.

Emerging Technologies and Best Practices:


To address these challenges, organizations are turning to emerging
technologies and best practices that offer scalable, secure, and efficient solutions
for data pipeline management. This includes the adoption of cloud-native
architectures, containerization technologies such as Docker and Kubernetes,
serverless computing platforms like AWS Lambda and Google Cloud Functions,
and distributed processing frameworks such as Apache Spark and Apache Flink.
Additionally, organizations are leveraging DevOps practices, automation tools, and
infrastructure-as-code principles to streamline deployment, monitoring, and
management of data pipelines.

28
16. Ingesting and preparing data

The Foundation of Data Ingestion:


Data ingestion serves as the gateway through which raw data enters the
organizational ecosystem, encompassing the processes and technologies involved
in extracting, transporting, and loading data from various sources into a centralized
repository. At its core, effective data ingestion revolves around three key
objectives: speed, scalability, and reliability. Speed ensures that data is ingested in
a timely manner, enabling real-time or near-real-time analytics. Scalability
guarantees that data pipelines can handle increasing volumes of data without
sacrificing performance or efficiency. Reliability ensures that data is ingested
accurately and consistently, minimizing data loss or corruption.

Strategies for Data Ingestion:


To achieve these objectives, organizations employ a variety of strategies and
technologies for data ingestion, each tailored to the unique requirements and
characteristics of their data ecosystem. This includes:
1..Batch Processing: Batch processing involves ingesting data in discrete chunks
or batches at scheduled intervals. This approach is well-suited for scenarios where
data latency is acceptable, such as historical analysis or batch reporting.
2. Stream Processing: Stream processing enables the ingestion of data in real-
time as it is generated or produced. This approach is ideal for scenarios where
immediate insights or actions are required, such as monitoring, anomaly detection,
or fraud detection.
3. Change Data Capture (CDC): CDC techniques capture and replicate
incremental changes to data sources, enabling organizations to ingest only the
modified or updated data, rather than the entire dataset. This minimizes processing
overhead and reduces latency, making it well-suited for scenarios where data
freshness is critical.

29
Data Preparation:
Once data is ingested into the organizational ecosystem, it must undergo a process
of preparation to make it suitable for analysis, modeling, or visualization. Data
preparation encompasses a range of activities, including cleaning, transforming,
enriching, and aggregating data to ensure its quality, consistency, and relevance.
This process is often iterative and involves collaboration between data engineers,
data scientists, and domain experts to identify, understand, and address data quality
issues and inconsistencies.

Technologies for Data Preparation:


To facilitate data preparation, organizations leverage a variety of technologies and
tools that automate and streamline the process. This includes:
1. Data Integration Platforms: Data integration platforms provide a unified
environment for orchestrating data ingestion, transformation, and loading tasks
across disparate sources and destinations. These platforms offer features such as
data profiling, data cleansing, and data enrichment to ensure data quality and
consistency.
2. Data Wrangling Tools: Data wrangling tools empower users to visually
explore, clean, and transform data without writing code. These tools offer intuitive
interfaces and built-in algorithms for tasks such as missing value imputation,
outlier detection, and feature engineering, enabling users to prepare data more
efficiently and effectively.
3. Data Preparation Libraries: Data preparation libraries, such as Pandas in
Python or Apache Spark's DataFrame API, provide programmable interfaces for
manipulating and transforming data at scale. These libraries offer a rich set of
functions and transformations for tasks such as filtering, grouping, and joining
data, enabling users to perform complex data preparation tasks with ease.

30
17. Ingesting by batch or by system
Understanding Batch Data Ingestion:
Batch data ingestion involves collecting and processing data in discrete
chunks or batches at scheduled intervals. This approach is well-suited for scenarios
where data latency is acceptable, such as historical analysis, batch reporting, or
periodic updates. Batch data ingestion offers several advantages, including
simplicity, scalability, and fault tolerance. By processing data in predefined
batches, organizations can optimize resource utilization, minimize processing
overhead, and ensure consistent performance even in the face of failures or
disruptions.

Strategies for Batch Data Ingestion:


To implement batch data ingestion effectively, organizations employ a
variety of strategies and technologies tailored to their specific requirements and
use cases. This includes:
1. Extract, Transform, Load (ETL): ETL processes involve extracting data from
various sources, transforming it into a structured format, and loading it into a
destination for analysis or storage. This approach is well-suited for scenarios where
data needs to be cleansed, aggregated, or enriched before further processing.
2. Batch Processing Frameworks: Batch processing frameworks, such as Apache
Hadoop or Apache Spark, provide distributed computing capabilities for
processing large volumes of data in parallel. These frameworks offer features such
as fault tolerance, data locality optimization, and job scheduling, making them
well-suited for batch data ingestion tasks.

Exploring Stream Data Ingestion:


In contrast to batch data ingestion, stream data ingestion involves processing
data in real-time as it is generated or produced. This approach is ideal for scenarios
where immediate insights or actions are required, such as monitoring, anomaly
detection, or fraud detection. Stream data ingestion offers several advantages,
including low latency, continuous processing, and real-time responsiveness. By
ingesting and processing data in real-time, organizations can react to events as they
occur, enabling faster decision-making and proactive intervention.

31
Strategies for Stream Data Ingestion:
To implement stream data ingestion effectively, organizations leverage a
variety of strategies and technologies that enable real-time data processing and
analysis. This includes:
1. Event-Driven Architectures: Event-driven architectures enable organizations
to ingest and process data in real-time in response to events or triggers. This
approach is well-suited for scenarios where immediate action is required, such as
IoT applications, real-time monitoring, or financial transactions processing.
2. Stream Processing Frameworks: Stream processing frameworks, such as
Apache Kafka or Apache Flink, provide distributed computing capabilities for
processing continuous streams of data in real-time. These frameworks offer
features such as fault tolerance, event time processing, and windowing semantics,
making them well-suited for stream data ingestion tasks.

Choosing the Optimal Approach:


The choice between batch and stream data ingestion depends on several
factors, including data latency requirements, processing complexity, resource
constraints, and use case objectives. While batch data ingestion offers simplicity,
scalability, and fault tolerance, stream data ingestion offers low latency, continuous
processing, and real-time responsiveness. Organizations must carefully evaluate
these factors and choose the optimal approach that aligns with their unique
requirements and objectives.

32
18. Storing and Organizing Data

The Foundation of Data Storage:


Data storage serves as the bedrock upon which the data ecosystem is built,
encompassing the processes and technologies involved in persisting and retrieving
data in a reliable and efficient manner. At its core, effective data storage revolves
around three key objectives: scalability, durability, and accessibility. Scalability
ensures that data storage solutions can accommodate growing volumes of data
without sacrificing performance or reliability. Durability guarantees that data is
protected against loss or corruption, even in the face of hardware failures or
disasters. Accessibility ensures that data is readily available and accessible to
authorized users, regardless of time, location, or device.
1. Relational Databases: Relational databases provide a structured and organized
approach to storing data, using tables, rows, and columns to represent data entities
and relationships. This approach is well-suited for scenarios where data integrity,
consistency, and relational querying capabilities are paramount.
2. NoSQL Databases: NoSQL databases offer a flexible and scalable approach to
storing and querying unstructured or semi-structured data, using document, key-
value, column-family, or graph-based data models. This approach is well-suited for
scenarios where data volumes are large, data schemas are dynamic, or horizontal
scalability is required.
Organizing Data for Efficiency: In addition to storage considerations, organizing
data effectively is critical to ensuring its usability, discoverability, and
maintainability. Data organization encompasses the processes and methodologies
involved in structuring, categorizing, and indexing data to facilitate efficient
retrieval and analysis. This includes:
1. Data Modeling: Data modeling involves defining the structure, relationships,
and constraints of data entities and attributes, typically using entity-relationship
diagrams, schema definitions, or object-oriented models. This approach helps
ensure data consistency, integrity, and interoperability across the organization.
2. Data Partitioning: Data partitioning involves dividing large datasets into
smaller, more manageable partitions based on certain criteria, such as time,
geography, or key ranges. This approach helps distribute data processing and
storage resources more evenly, improving performance, scalability, and
availability.

33
Technologies for Data Storage and Organization:
To facilitate data storage and organization effectively, organizations leverage a
variety of technologies and tools that offer scalability, reliability, and flexibility.
This includes:
1. Cloud Storage Services: Cloud storage services, such as Amazon S3, Google
Cloud Storage, or Microsoft Azure Blob Storage, provide scalable and durable
storage solutions for storing and managing data in the cloud. These services offer
features such as encryption, versioning, and lifecycle management, making them
well-suited for a wide range of use cases.
2. Data Lakes: Data lakes provide a centralized repository for storing and
managing large volumes of structured, semi-structured, and unstructured data in its
native format. This approach enables organizations to ingest, store, and analyze
diverse datasets without the need for predefined schemas or data transformations.

34
19. Processing Big Data

To overcome these challenges, organizations employ a variety of strategies and


technologies for big data processing, each tailored to the unique requirements and
characteristics of their data ecosystem. This includes:
1. Distributed Computing: Distributed computing frameworks, such as Apache
Hadoop, Apache Spark, and Apache Flink, provide scalable and fault-tolerant
platforms for processing big data in parallel across distributed clusters of
commodity hardware. These frameworks offer features such as distributed storage,
data locality optimization, and fault tolerance, making them well-suited for batch
and stream processing of big data.
2. In-Memory Processing: In-memory processing technologies, such as Apache
Ignite, Apache Arrow, and Redis, leverage the power of RAM to accelerate data
processing and analysis by keeping data in memory rather than accessing it from
disk. This approach enables faster query execution, iterative processing, and real-
time analytics, making it well-suited for interactive and exploratory data analysis.

Architectures for Big Data Processing:


In addition to processing strategies, organizations must design architectures that
enable efficient and scalable big data processing workflows. This includes:
1. Lambda Architecture: The Lambda architecture provides a framework for
building robust and fault-tolerant big data processing pipelines that can handle both
batch and stream processing of data. By combining batch and speed layers in
parallel, organizations can achieve comprehensive insights with low latency,
enabling real-time and near-real-time analytics.
2. Kappa Architecture: The Kappa architecture offers a simplified alternative to
the Lambda architecture by eliminating the batch layer and relying solely on stream
processing for data ingestion and analysis. This approach simplifies the
architecture, reduces complexity, and enables faster time-to-insight, making it
well-suited for real-time analytics and event-driven applications.

35
Best Practices for Big Data Processing:
To optimize big data processing workflows, organizations should adhere to several
best practices, including:
1. Data Partitioning and Sharding: Partitioning large datasets into smaller, more
manageable chunks enables parallel processing and improves scalability and
performance. By dividing data based on certain criteria, such as time, geography,
or key ranges, organizations can distribute processing and storage resources more
evenly, minimizing bottlenecks and contention.
2. Data Compression and Serialization: Compressing and serializing data before
processing reduces storage and bandwidth requirements, improves data transfer
speeds, and accelerates query execution. By using efficient compression
algorithms and serialization formats, such as Apache Avro or Protocol Buffers,
organizations can minimize data footprint and optimize resource utilization.

36
20. Processing Data for ML

The Role of Data Processing in Machine Learning:


Data processing for machine learning encompasses the techniques and
technologies involved in preparing, cleaning, and transforming raw data into a
format suitable for training machine learning models. At its core, data processing
serves several critical functions, including feature extraction, normalization, and
dimensionality reduction. These preprocessing steps are essential for improving
model performance, reducing overfitting, and ensuring generalization to unseen
data.
Strategies for Data Processing in Machine Learning:
To achieve these objectives, organizations employ a variety of strategies and
techniques for data processing in machine learning, each tailored to the unique
requirements and characteristics of their data ecosystem. This includes:
1. Feature Engineering: Feature engineering involves selecting, extracting, and
transforming relevant features from raw data to facilitate model learning and
prediction. This may include numerical features, categorical features, text features,
or image features, depending on the nature of the data and the specific machine
learning task.
2. Data Normalization: Data normalization techniques, such as min-max scaling
or standardization, ensure that input features are on a similar scale, preventing
certain features from dominating the learning process and improving model
convergence and stability.

Architectures for Data Processing in Machine Learning:


In addition to processing strategies, organizations must design architectures that
enable efficient and scalable data processing workflows for machine learning. This
includes.
1. Data Pipelines: Data pipelines provide a structured framework for orchestrating
data processing tasks, from ingestion to preparation to training. By automating and
streamlining the data processing workflow, organizations can ensure consistency,
reproducibility, and scalability in machine learning model development.

37
2. Model Serving Infrastructure: Model serving infrastructure enables
organizations to deploy trained machine learning models into production
environments, where they can serve real-time predictions or batch inference
requests. By decoupling model inference from model training, organizations can
achieve greater flexibility, scalability, and reliability in deploying machine learning
solutions.

38
21. Analyzing and Visualizing data

The Role of Data Analysis and Visualization:


Data analysis and visualization encompass the techniques and technologies
involved in exploring, summarizing, and communicating insights from raw data in
a visual and intuitive manner. At its core, data analysis serves several critical
functions, including descriptive analysis, diagnostic analysis, predictive analysis,
and prescriptive analysis. These analyses enable organizations to uncover patterns,
trends, anomalies, and relationships within their data, providing a foundation for
informed decision-making and strategic planning.

Strategies for Data Analysis:


1. Descriptive Analysis: Descriptive analysis involves summarizing and
aggregating data to provide a high-level overview of key metrics, trends, and
distributions. This may include summary statistics, frequency distributions,
histograms, or heatmaps, depending on the nature of the data and the specific
analysis objectives.
2. Diagnostic Analysis: Diagnostic analysis focuses on understanding the root
causes of observed patterns or anomalies within the data. This may involve
hypothesis testing, correlation analysis, regression analysis, or causal inference
techniques to identify relationships and dependencies between variables.

Strategies for Data Visualization:


1. Charts and Graphs: Charts and graphs are powerful tools for visualizing
patterns, trends, and relationships within the data. This may include bar charts, line
charts, scatter plots, pie charts, or box plots, each offering unique advantages for
representing different types of data and analysis objectives.
2. Dashboards: Dashboards provide a centralized and interactive interface for
visualizing and exploring data in real-time. This may include interactive charts,
tables, maps, or widgets, enabling users to drill down into specific data subsets,
filter data based on criteria, and gain deeper insights into key metrics and KPIs.

39
Best Practices for Data Analysis and Visualization:
To optimize data analysis and visualization workflows, organizations should
adhere to several best practices, including:
1. Audience-Centric Design: Designing visualizations with the end-user in mind
ensures that insights are communicated effectively and resonate with the intended
audience. Organizations should consider factors such as audience demographics,
preferences, and prior knowledge when designing visualizations.

2. Iterative Exploration: Data analysis and visualization are iterative processes


that require continuous exploration and refinement. Organizations should
encourage a culture of experimentation and iteration, where insights are refined
based on feedback, new data, and evolving analysis objectives.

40
22. Automating the pipeline
Pipeline automation encompasses the techniques and technologies involved in
orchestrating, scheduling, and monitoring data processing tasks across the data
pipeline lifecycle. At its core, pipeline automation serves several critical functions,
including:
1. Workflow Orchestration: Automating the sequencing and dependencies of
data processing tasks ensures that they are executed in the correct order and at the
appropriate times, minimizing delays, errors, and resource contention.

2. Resource Management: Automating the allocation and deallocation of


computing resources, such as CPU, memory, and storage, ensures that data
processing tasks have access to the necessary resources to execute efficiently and
reliably.

3. Monitoring and Alerting: Automating the monitoring of data pipeline


performance and health enables organizations to detect anomalies, errors, and
failures in real-time and take corrective actions proactively, minimizing downtime
and disruptions.

Strategies for Pipeline Automation:


1. Workflow Management Systems: Workflow management systems, such as
Apache Airflow, Apache NiFi, or Luigi, provide a centralized platform for
defining, scheduling, and executing data processing workflows. These systems
offer features such as task dependencies, scheduling, retry mechanisms, and
monitoring capabilities, making them well-suited for orchestrating complex data
pipelines.

2. Infrastructure as Code: Infrastructure as code (IaC) frameworks, such as


Terraform or AWS CloudFormation, enable organizations to automate the
provisioning and configuration of computing resources, such as virtual machines,
containers, or serverless functions, needed to execute data processing tasks. By
defining infrastructure as code, organizations can ensure consistency,
reproducibility, and scalability in their data pipeline deployments.

41
Best Practices for Pipeline Automation:
1. Modular Design: Breaking down data processing tasks into smaller, modular
components enables organizations to build reusable, composable workflows that
can be easily scaled, extended, and maintained over time. This promotes flexibility,
agility, and maintainability in data pipeline development.

2. Continuous Integration and Deployment: Adopting continuous integration


and deployment (CI/CD) practices enables organizations to automate the testing,
validation, and deployment of data pipeline changes in a rapid and reliable manner.
By automating the deployment pipeline, organizations can reduce manual errors,
accelerate time-to-market, and improve overall pipeline reliability and stability.

Challenges and Considerations:


Despite the benefits of pipeline automation, organizations must contend with
several challenges and considerations, including:
1. Complexity: Automating complex data pipelines with heterogeneous data
sources, dependencies, and processing requirements can be challenging and require
careful planning, design, and implementation.
2. Security and Compliance: Ensuring the security and compliance of automated
data pipelines is paramount, particularly when dealing with sensitive or regulated
data. Organizations must implement robust access controls, encryption
mechanisms, and auditing capabilities to protect data privacy and mitigate
regulatory risks.

42
LABS
LAB 1
AMAZON-S3
Amazon Simple Storage Service (Amazon S3) is an object storage service that
offers industry-leading scalability, data availability, security, and performance.
Customers of all sizes and industries can use Amazon S3 to store and protect any
amount of data for a range of use cases, such as data lakes, websites, mobile
applications, backup and restore, archive, enterprise applications, IoT devices,
and big data analytics. Amazon S3 provides management features so that you can
optimize, organize, and configure access to your data to meet your specific
business, organizational, and compliance requirements. [2]
Amazon S3 is an object storage service that stores data as objects within
buckets. An object is a file and any metadata that describes the file. A bucket is
a container for objects. To store your data in Amazon S3, you first create a
bucket and specify a bucket name and AWS Region. Then, you upload your data
to that bucket as objects in Amazon S3. Each object has a key (or key name),
which is the unique identifier for the object within the bucket.

S3 provides features that you can configure to support your specific use case.
For example, you can use S3 Versioning to keep multiple versions of an object
in the same bucket, which allows you to restore objects that are accidentally
deleted or overwritten. Buckets and the objects in them are private and can be
accessed only if you explicitly grant access permissions. You can use bucket
policies, AWS Identity and Access Management (IAM) policies, access control
lists (ACLs), and S3 Access Points to manage access.

43
LAB 2

AMAZON ATHENA:

Amazon Athena is an interactive query service that makes it easy to analyze data
in Amazon S3 using standard SQL. Athena is serverless, so there is no
infrastructure to manage, and you pay only for the queries that you run. Athena
is easy to use. Simply point to your data in Amazon S3, define the schema, and
start querying using standard SQL. Most results are delivered within seconds.
With Athena, there’s no need for complex ETL jobs to prepare your data for
analysis. This makes it easy for anyone with SQL skills to quickly analyze large-
scale datasets. [3]

Athena is out-of-the-box integrated with AWS Glue Data Catalog, allowing you
to create a unified metadata repository across various services, crawl data
sources to discover schemas and populate your Catalog with new and modified
table and partition definitions, and maintain schema versioning.

Fig-2.1 Amazon athena

44
LAB-3
AMAZON GLUE:
AWS Glue is a serverless data integration service that makes it easier to discover,
prepare, move, and integrate data from multiple sources for analytics, machine
learning (ML), and application development.[4]
Preparing your data to obtain quality results is the first step in an analytics or ML
project. AWS Glue is a serverless data integration service that makes data
preparation simpler, faster, and cheaper. You can discover and connect to over 70
diverse data sources, manage your data in a centralized data catalog, and visually
create, run, and monitor ETL pipelines to load data into your data lakes.

Fig-3.1 Amazon Glue

45
LAB-4
AMAZON REDSHIFT
Amazon Redshift uses SQL to analyze structured and semi-structured data across
data warehouses, operational databases, and data lakes, using AWS-designed
hardware and machine learning to deliver the best price performance at any scale.
[5]

Fig-4.1(Amazon Redshift)

46
LAB-5
ANALYSE DATA WITH AMAZON SAGEMAKER:
Amazon SageMaker is a fully managed machine learning service. With
SageMaker, data scientists and developers can quickly and easily build and train
machine learning models, and then directly deploy them into a production-ready
hosted environment. It provides an integrated Jupyter authoring notebook
instance for easy access to your data sources for exploration and analysis, so you
don't have to manage servers. It also provides common machine learning
algorithms that are optimized to run efficiently against extremely large data in a
distributed environment. With native support for bring-your-own-algorithms and
frameworks, SageMaker offers flexible distributed training options that adjust to
your specific workflows. Deploy a model into a secure and scalable environment
by launching it with a few clicks from SageMaker Studio or the SageMaker
console. Training and hosting are billed by minutes of usage, with no minimum
fees and no upfront commitments. [6]
Creating jupyter notebook with amazon sagemaker
1. Open the notebook instance as follows:
a. Sign in to the SageMaker console at
https://console.aws.amazon.com/sagemaker/.
b. On the Notebook instances page, open your notebook instance by choosing
either Open JupyterLab for the JupyterLab interface or Open Jupyter for the
classic Jupyter view.
2. Create a notebook as follows:
a. If you opened the notebook in the JupyterLab view, on the File menu, choose
New, and then choose Notebook. For Select Kernel, choose conda_python3. This
preinstalled environment includes the default Anaconda installation and Python
3.
b. If you opened the notebook in the classic Jupyter view, on the Files tab, choose
New, and then choose conda_python3. This preinstalled environment includes the
default Anaconda installation and Python 3.
3. Save the notebooks as follows:
a. In the JupyterLab view, choose File, choose Save Notebook As..., and then
rename the notebook. b. In the Jupyter classic view, choose File, choose Save
as..., and then rename the notebook.

47
LAB-6
LOAD DATA USING PIPELINE:
AWS Data Pipeline is a web service that you can use to automate the movement
and transformation of data.
With AWS Data Pipeline, you can define data-driven workflows, so that tasks can
be dependent on the
successful completion of previous tasks. You define the parameters of your data
transformations and AWS
Data Pipeline enforces the logic that you've set up.
The following components of AWS Data Pipeline work together to manage your
data:
• A pipeline definition specifies the business logic of your data management. For
more information, see
Pipeline Definition File Syntax.
• Apipeline schedules and runs tasks by creating Amazon EC2 instances to perform
the defined work
activities. You upload your pipeline definition to the pipeline, and then activate the
pipeline. You can
edit the pipeline definition for a running pipeline and activate the pipeline again
for it to take effect.
You can deactivate the pipeline, modify a data source, and then activate the pipeline
again. When you
are finished with your pipeline, you can delete it.
• Task Runner polls for tasks and then performs those tasks. For example, Task
Runner could copy log
files to Amazon S3 and launch Amazon EMR clusters. Task Runner is installed and
runs automatically
on resources created by your pipeline definitions. You can write a custom task
runner application, or
you can use the Task Runner application that is provided by AWS Data Pipeline.
For more information,

48
see Task Runners.

Fig-6.1 Loading Data

49
LAB-7
ANALIZE SREAMING DATA:
Amazon Kinesis Data Firehose is an extract, transform, and load (ETL) service
that reliably captures,
transforms, and delivers streaming data to data lakes, data stores, and analytics
services
.Amazon Kinesis is a suite of services for processing streaming data. With Amazon
Kinesis, you can ingest
real-time data such as video, audio, website clickstreams, or application logs. You
can process and analyze
the data as it arrives, instead of capturing it all to storage before you begin analysis.

Fig-7.1 Analysing data

50
LAB-8
ANALYSE IOT DATAWITH AWS IOTANALYTICS:
AWS IoT Analytics automates the steps required for analyzing IoT data. You can
filter, transform, and enrich
the data before storing it in a time-series data store. AWS IoT Core provides
connectivity between IoT devices
and AWS Services. IoT Core is fully-integrated with IoT Analytics.
IoT data is highly unstructured which makes it difficult to analyze with traditional
analytics and business
intelligence tools that are designed to process structured data. IoT data comes from
devices that often record
fairly noisy processes (such as temperature, motion, or sound). The data from these
devices can frequently
have significant gaps, corrupted messages, and false readings that must be cleaned
up before analysis can
occur. Also, IoT data is often only meaningful in the context of additional, third
party data inputs. For
example, to help farmers determine when to water their crops, vineyard irrigation
systems often enrich
moisture sensor data with rainfall data from the vineyard, allowing for more
efficient water usage while
maximizing harvest yield. [7]

Fig-8.1 IOT Data

51
CASE STUDY

Transforming Retail Intelligence: Backcountry.com's Journey to


Modernize Data Architecture with Google Cloud and Looker

Abstract: Starting in early 2020, Backcountry and Ternary Data embarked


on a
mission to modernize Backcountry’s data architecture. Originally on
AWS,
Backcountry evaluated many options and settled on Google Cloud and Looker.
This
involved a complete overhaul of Backcountry’s data warehouse and
machine
learning infrastructure and practices. Ternary Data helped Backcountry’s
data
team evaluate several architectures on Google Cloud Platform, before proving
out
a final configuration for modernization and building this workload in
production.

The Challenges: When Backcountry engaged Ternary Data for data architecture
strategy, it was clear that they needed an overhaul of their legacy data
infrastructure. Some challenges included:

• Analytics and Reporting - The business and data scientists struggled to access
data within their oversubscribed and cost-prohibitive Oracle on-premises data
warehouse. Queries were slow at an average return time of more than 97
seconds, often timing out after several hours. Storage was extremely limited
due the cost of scaling databases that combine storage with compute, forcing
pruning of valuable data. As a result, Backcountry could only store 1 year's
worth of web data, alongside its historical transaction data. This meant over
90% of the company's available useful data like Google Ads, Facebook,
historical clickstream, machine learning outputs, or large cost centers like
hourly employee costs were not available or accessible by analysts and data
scientists. Additionally, Backcountry used OBIEE for its business intelligence,
which was cumbersome and lacked both basic and advanced functionalities the
team needed to improve data governance and reporting automation.
• Overly complex infrastructure - Backcountry was maintaining both a data lake
(Amazon S3 with Databricks) and a data warehouse (Oracle). Managing data
and schemas in the data lake was a significant labor cost, and keeping data in

52
sync across Oracle, S3 and other systems was a constant struggle. They also
use some legacy Microsoft SQL Server and Postgres databases for reporting
right now, which will eventually be deprecated.
• Data pipelines - Backcountry used Talend, which functioned in a pure
chronological and on-prem stack. As the company grew they increasingly
required event-based (Airflow) pipelines, and cloud-native partners.

53
• Machine learning - Data science initiatives were slow to deliver value due to
the limitations of the Oracle data warehouse and incoherent data pipelines.
While the team had shipped demonstrably high value machine learning models,
scaling beyond basic machine learning, or scaling beyond a single department
was made impossible by the stack. Even a 10% downsampled training dataset
often took over 10 hours to pull from the database, and most of the time these
larger analytic queries would simply fail after a few hours of running.
Additionally the lack of data governance and clear business-
defined/democratized data definitions and metrics meant that even when
models could be built, they could be optimizing toward incorrect metrics.

Requirements:
• Analytics and Reporting - The business must get fast and actionable insights,
with an SLA of 30 seconds for query response. 95% of queries must return
with SLA.
• Simple Infrastructure - The system should be cost effective and serverless
wherever possible - Avoid undifferentiated heavy lifting. Allow data
engineers to focus on data rather than operations.
• Data pipelines - The new data pipelines must scale as data and analytics
requirements grow in the future. The tool must seamlessly integrate with
BigQuery.
• Machine learning - Data scientists need a platform for rapid experimentation
and the ability to put their models into production.
• Marketing- Backcountry wants to integrate Google Analytics, Ads and

Solution: Given the requirements, Ternary Data architected a cost-effective and


low maintenance solution.

1. Analytics and Reporting


BigQuery increased the speed of Backcountry’s decisions and reporting. As a
benchmark, querying all of their historical sales used to take over 10 hours.
Now, the same queries take under 3 minutes.
Actionable data delivered through BigQuery is readily available for the
business and machine learning projects. Backcountry leverages BigQuery to
quickly analyze and model customer lifetime value, personalization,
clickstream, marketing data, and much more.
Looker is used for actionable analytics and visualization. Looker’s LookML
provides sound data governance and consistency.

54
2. Simple Infrastructure
o The Google Cloud Platform Marketplace significantly reduces the
complexity and cost of Backcountry’s data stack. Backcountry now
interfaces with one main vendor, and pays one bill.
o Backcountry’s data team builds expertise in one cloud platform, on
Google Cloud’s best of breed data technologies.
3. Data Pipelines
o Cloud Composer is based on the open source Apache Airflow project and
the Python programming language, making it a good fit for the team’s
skillset. It provides deep integration with the entire GCP ecosystem,
combined with the power to connect to nearly any data source through
custom code.
o Fivetran’s hands-off and seamless data pipelines into BigQuery greatly
reduced the operational burden on Backcountry’s data team.
4. Machine learning
o Data scientists are able to experiment and iterate very quickly using
Google
Cloud’s AI Platform, BigQuery ML, and AutoML.
o Backcountry can rapidly incorporate machine learning models into
production, generating a lot of value for both Backcountry and its
customers.

Result:
Data is available and useful across the business. Actionable data is readily
available for the business and machine learning projects, using BigQuery and
Looker.
Reduction in query times for key data. With BigQuery as the data
warehouse, query times for historical sales went from taking 10 hours to
under 3 minutes.

55
56
CONCLUSION

The collaboration between Backcountry.com and Ternary Data represents a paradigm


shift in modernizing data architecture for businesses operating in the digital landscape.
By transitioning from legacy on-premises systems to Google Cloud and Looker,
Backcountry.com unlocked the full potential of its data assets, facilitating faster
decision-making and empowering data-driven insights across the organization.

The challenges of slow query times, limited data accessibility, and complex
infrastructure were effectively addressed through strategic implementation of Google
Cloud Platform services. BigQuery emerged as a game-changer, reducing query times
from hours to minutes and providing actionable data for both business operations and
machine learning initiatives.

Simplified infrastructure, enabled by the Google Cloud Platform Marketplace,


streamlined operations and reduced costs, while empowering Backcountry.com's data
team to focus on innovation rather than maintenance tasks. The adoption of Cloud
Composer and Fivetran for data pipelines further enhanced efficiency, enabling
seamless integration with BigQuery and other data sources.

Machine learning initiatives thrived on Google Cloud's AI Platform, BigQuery ML,


and AutoML, enabling data scientists to experiment and iterate rapidly, ultimately
delivering tangible value to Backcountry.com and its customers.

Overall, the successful modernization of Backcountry.com's data architecture


underscores the importance of agility, scalability, and innovation in today's data- driven
business landscape. Through strategic partnerships and adoption of cutting- edge
technologies, Backcountry.com positioned itself for continued success in leveraging
data as a strategic asset for driving growth and customer satisfaction.

57
REFERENCE LINKS

• Data Engineering on AWS


https://awsacademy.instructure.com/courses/68245/modules#
module_824165

• Amazon-S3
https://docs.aws.amazon.com/AmazonS3/latest/userguide/We
lcome.html

• Amazon Athena
https://aws.amazon.com/athena/?whats-new-cards.sort-
by=item.additionalFields.postDateTime&whats-new-
cards.sort-order=desc

• Amazon Glue
https://aws.amazon.com/glue/

• Amazon Redshift

https://aws.amazon.com/redshift/

• Amazon Sagemaker
https://docs.aws.amazon.com/sagemaker/latest/dg/whatis.htm
l

• Amazon IOT data Analysis


https://aws.amazon.com/iot-analytics/

• Case Study

https://www.ternarydata.com/case-study-backcountrycom

58

You might also like