Aws Data Engineer
Aws Data Engineer
by
SAMMATHAMU SESHANK (20131A05L9)
Under the esteemed guidance of
Name of the Course Coordinator Name of the Internship Mentor
Dr. Ch. Sita Kumari Mr. R. Siva Kumar
Associate Professor Assistant Professor
i
Gayatri Vidya Parishad College of Engineering (Autonomous)
Visakhapatnam
CERTIFICATE
This report on
“AWS DATA ENGINEERING VIRTUAL INTERNSHIP”
is a bonafide record of the Internship work submitted
by
SAMMATHAMU SESHANK (20131A05L9)
In their VIII semester in partial fulfilment of the requirements for the Award
of Degree of
Bachelor of Technology in Computer Science and Engineering
Internship mentor
Mr. R. Siva Kumar
Assistant professor
ii
ACKNOWLEDGEMENT
We would like to express our deep sense of gratitude to our esteemed institute
Gayatri Vidya Parishad College of Engineering (Autonomous), which has provided us
an opportunity to fulfil our cherished desire.
Finally, we are indebted to the teaching and non-teaching staff of the Computer
Science and Engineering Department for all their support in the completion of our
project.
iii
iv
ABSTRACT
Data engineering is the science of analysing raw data to make conclusions about that
information. Data engineering relies on a variety of software tools ranging from
spreadsheets, data visualization, and reporting tools, data mining programs, or open-
source languages for the greatest data manipulation. Data Analysis mainly deals with data
collection, data storage, data preprocessing and data visualisation.
In the course-1 we have learnt about the cloud computing, Cloud computing is the on-
demand delivery of compute power, database, storage, applications, and other IT
resources via the internet with pay-as-you-go pricing. These resources run on server
computers that are located in large data centers in different locations around the world.
When you use a cloud service provider like AWS, that service provider owns the
computers that you. This course deals with the following main concepts of compute
services, storge services, management services, database services, compliance services,
AWS cost management services
As a part of our course-2 we learnt data engineering which deals with the raw data to draw
solutions. In this course we have learnt about big data which is the main and foremost
important tool for the data analysis so, this big data is very important for data engineering,
and here comes the problem of storing the data for that we have learnt different tools used
for data storage and how to analyse the big data and preprocess the data the following are
the concepts that we learnt in this course. The main concepts includes storage using
AMAZON includes amazon S3, amazon athena, amazon redshift, amazon glue, amazon
sagemaker and amazon IOT analysis.
v
Index
COURSE-1
CLOUD FOUNDATION
vi
8 Databases 14-15
8.1-Relational database services
8.2-Cloud architecture
8.3-AWS trusted adviser
8.4-Automatic scaling and Monitoring
9 Cloud Architecture 16
9.1-Cloud architects
9.2-Reliability and Availability
10 Auto scaling and monitoring 17-18
10.1-Elastic load balance
10.2-Elastic cloud watch
10.3 -Elastic EC2 auto scaling
COURSE-2 DATA ENGINEERING
vii
22 Automating the Pipeline 41-42
23 Labs 43-51
25 Conclusion 55
26 Reference Links 56
viii
CLOUD FOUNDATIONS
PaaS
SaaS
1. Infrastructure as hardware
2. Hardware Solutions:
•Infrastructure as software
1
•Software solutions:
1. Are flexible
2. Can change more quickly, easily, and cost-effectively than hardware solutions
\
Fig-1.2.1 Cloud Service models
Trade capital expense for variable expense Benefit from massive economies of scale Stop
guessing capacity
• Increase speed and agility Go global in minutes
• Stop spending money on running and maintaining data centers
Amazon Web Services (AWS) is a secure cloud platform that offers a broad set of global
cloud based products. Because these products are delivered over the internet, you have
on-demand access to the compute, storage, network, database, and other IT resources
that you might need for your projects— and the tools to manage them. AWS offers
flexibility. Your AWS environment can be reconfigured and updated on demand, scaled
up or down automatically to meet usage patterns and optimize spending, or shut down
temporarily or permanently.
2
Fig-1.3.1 Services covered in the course
AWS CAF provides guidance and best practices to help organizations build a
comprehensive approach to cloud computing across the organization and throughout the
IT lifecycle to accelerate successful cloud adoption.
3
2. CLOUD ECONOMICS AND BILLING
AWS realizes that every customer has different needs. If none of the AWS pricing
models work for your project, custom pricing is available for high-volume projects with
unique requirements. There are some rules associated with the amount paying to aws:
•There is no charge (with some exceptions) for:
•Inbound data transfer.
•Data transfer between services within the same AWS Region.
•Pay for what you use.
•Start and stop anytime.
•No long-term contracts are required.
•Some services are free, but the other AWS services that they provision might not be
free
2.3 What is TCO?
Total Cost of Ownership (TCO) is the financial estimate to help identify direct and
indirect costs of a system.
4
Use the AWS Pricing Calculator to:
5
3. AWS GLOBAL INFRASTRUCTURE
The AWS Global Infrastructure is designed and built to deliver a flexible, reliable,
scalable, and Secure cloud computing environment with high-quality global network
performance. The AWS Cloud infrastructure is built around regions.
High availability
6
4. AWS CLOUD SECURITY
4.3 IAM:
With AWS identity and access management (IAM) you can specify who or what can
access services and resources in AWS. IAM is a web service that helps securely control
access to AWS resources. We use IAM control who is authenticated and authorized to
use resources.
There are two types of IAM policies:
1. AWS managed policies
2. Customer managed policies
7
holds personal data. If someone obtain the access to the information you could become
the victim of identity theft. Make your account more secure by –
8
5. NETWORKING AND CONTENT DELIVERY
9
Amazon Route 53 supports several types of routing policies, which determine how
Amazon Route 53 responds to queries:
10
6. COMPUTE
11
6.2 Amazon Ec2
• Gives you full control over the guest operating system (Windows or Linux)
on each instance.
• You can launch instances of any size into an Availability Zone anywhere in the
world.
• Launch instances with a few clicks or a line of code, and they are ready in
minutes.
• Repeatable.
• Self-contained environments.
12
7. STORAGE
7.1 Amazon EBS:
Amazon EBS provides persistent block storage volumes for use with Amazon
EC2 instances. Persistent storage is any data storage device that retains data after
power to that device is shutoff. It is also sometimes called non- volatile storage.
Each Amazon EBS volume is automatically replicated within its Availability
Zone to protect you from component failure. It is designed for high availability
and durability. Amazon EBS volumes provide the consistent and low- latency
performance that is needed to run your workloads.
13
8. Databases
8.1 Relational database service:
14
8.3 Aws trusted advisor:
AWS Trusted Advisor is an online tool that provides real time guidance to help
you provision your resources following AWS best practices.
1. Cost Optimization: AWS Trusted Advisor looks at your resource use and
makes recommendations to help you optimize cost by eliminating unused and
idle resources, or by making commitments to reserved capacity.
2.Performance: Improve the performance of your service by checking your
service limits, ensuring you take advantage of provisioned throughput, and
monitoring for overutilized instances.
3. Security: Improve the security of your application by closing gaps, enabling
various AWS security features, and examining your permissions.
4. Fault Tolerance: Increase the availability and redundancy of your AWS
application
15
9. CLOUD ARCHITECTURE
9.1 Cloud architects:
•Engage with decision makers to identify the business goal and the capabilities
that need improvement.
•Ensure alignment between technology deliverables of a solution and the
business goals.
• Work with delivery teams that are implementing the solution to ensure that the
technology
•A guide for designing infrastructures that are:
✓ Secure
✓ High performing
✓ Resilient
16
10. AUTOMATIC SCALING AND MONITORING
17
Fig-10.1.2 Load balancers
18
DATAENGINEERING
11. Introduction to data Engineering
Data engineering is the process of designing and building systems that let
people collect and analyze raw data from multiple sources and formats. These
systems empower people to find practical applications of the data, which
businesses can use to thrive.
Data engineering is a skill that is in increasing demand. Data engineers are the
people who design the system that unifies data and can help you navigate it.
Data engineers perform many different tasks including:
• Acquisition: Finding all the different data sets around the business
• Cleansing: Finding and cleaning any errors in the data
• Conversion: Giving all the data a common format
• Disambiguation: Interpreting data that could be interpreted in multiple
ways
• Deduplication: Removing duplicate copies of data
Once this is done, data may be stored in a central repository such as a data
lake or data lakehouse. Data engineers may also copy and move subsets of data
into a data warehouse.
19
20
12. Data Driven Organizations
The Genesis of Data-Driven Organizations:
The genesis of data-driven organizations can be traced back to the dawn of
the information age, where the advent of technology ushered in an era of
unprecedented data proliferation. Initially viewed as mere byproducts of digital
transactions, data soon evolved into a strategic asset, offering insights into
consumer behavior, market trends, and operational efficiency. Organizations began
to recognize the intrinsic value of data, laying the foundation for a paradigm shift
in decision-making – from intuition-driven to data-driven.
22
13. The Elements of Data
Structured Data:
Structured data represents the foundation of traditional databases,
characterized by its organized format, predictable schema, and tabular structure.
This form of data lends itself well to relational databases, where information is
stored in rows and columns, enabling efficient storage, retrieval, and analysis.
Examples of structured data include transaction records, customer profiles, and
inventory lists, each meticulously organized to facilitate easy access and
manipulation.
Unstructured Data:
In contrast to structured data, unstructured data defies conventional
categorization, encompassing a wide array of formats, including text, images,
videos, and audio recordings. This form of data lacks a predefined schema, making
it inherently more challenging to analyze and interpret. However, within the realm
of unstructured data lies a treasure trove of insights, waiting to be unearthed
through advanced analytics, natural language processing, and machine learning
algorithms.
Semi-Structured Data:
23
documents, and log files, each offering a flexible framework for storing and
exchanging information across disparate systems.
24
14. Data Principles and Patterns for Data Pipelines
This exploration delves into the intricacies of data pipeline design, unveiling
the principles and patterns that enable organizations to orchestrate data flows
efficiently, reliably, and at scale.
25
4. Microservices Architecture: In the realm of distributed systems, microservices
architecture offers a modular approach to building data pipelines, where individual
components or services are decoupled and independently deployable. This pattern
enables greater agility, scalability, and fault isolation, albeit at the cost of increased
complexity.
26
15. Securing and Scaling the data pipeline
Scaling the data pipeline involves expanding its capacity and capabilities to
accommodate growing volumes of data, users, and workloads. This requires a
strategic approach that encompasses both vertical and horizontal scaling
techniques. Vertical scaling involves adding more resources, such as CPU,
memory, or storage, to existing infrastructure to handle increased demand.
Horizontal scaling, on the other hand, involves distributing workloads across
multiple nodes or instances, enabling parallel processing and improved
performance.
Scalability Considerations:
While scalability unlocks new opportunities for growth and innovation, it
also introduces a host of challenges and considerations. Organizations must
carefully evaluate factors such as data volume, velocity, variety, and veracity, as
27
well as the underlying infrastructure, architecture, and technology stack. Moreover,
as data pipelines scale in complexity and size, managing dependencies, optimizing
performance, and ensuring fault tolerance become increasingly critical tasks.
28
16. Ingesting and preparing data
29
Data Preparation:
Once data is ingested into the organizational ecosystem, it must undergo a process
of preparation to make it suitable for analysis, modeling, or visualization. Data
preparation encompasses a range of activities, including cleaning, transforming,
enriching, and aggregating data to ensure its quality, consistency, and relevance.
This process is often iterative and involves collaboration between data engineers,
data scientists, and domain experts to identify, understand, and address data quality
issues and inconsistencies.
30
17. Ingesting by batch or by system
Understanding Batch Data Ingestion:
Batch data ingestion involves collecting and processing data in discrete
chunks or batches at scheduled intervals. This approach is well-suited for scenarios
where data latency is acceptable, such as historical analysis, batch reporting, or
periodic updates. Batch data ingestion offers several advantages, including
simplicity, scalability, and fault tolerance. By processing data in predefined
batches, organizations can optimize resource utilization, minimize processing
overhead, and ensure consistent performance even in the face of failures or
disruptions.
31
Strategies for Stream Data Ingestion:
To implement stream data ingestion effectively, organizations leverage a
variety of strategies and technologies that enable real-time data processing and
analysis. This includes:
1. Event-Driven Architectures: Event-driven architectures enable organizations
to ingest and process data in real-time in response to events or triggers. This
approach is well-suited for scenarios where immediate action is required, such as
IoT applications, real-time monitoring, or financial transactions processing.
2. Stream Processing Frameworks: Stream processing frameworks, such as
Apache Kafka or Apache Flink, provide distributed computing capabilities for
processing continuous streams of data in real-time. These frameworks offer
features such as fault tolerance, event time processing, and windowing semantics,
making them well-suited for stream data ingestion tasks.
32
18. Storing and Organizing Data
33
Technologies for Data Storage and Organization:
To facilitate data storage and organization effectively, organizations leverage a
variety of technologies and tools that offer scalability, reliability, and flexibility.
This includes:
1. Cloud Storage Services: Cloud storage services, such as Amazon S3, Google
Cloud Storage, or Microsoft Azure Blob Storage, provide scalable and durable
storage solutions for storing and managing data in the cloud. These services offer
features such as encryption, versioning, and lifecycle management, making them
well-suited for a wide range of use cases.
2. Data Lakes: Data lakes provide a centralized repository for storing and
managing large volumes of structured, semi-structured, and unstructured data in its
native format. This approach enables organizations to ingest, store, and analyze
diverse datasets without the need for predefined schemas or data transformations.
34
19. Processing Big Data
35
Best Practices for Big Data Processing:
To optimize big data processing workflows, organizations should adhere to several
best practices, including:
1. Data Partitioning and Sharding: Partitioning large datasets into smaller, more
manageable chunks enables parallel processing and improves scalability and
performance. By dividing data based on certain criteria, such as time, geography,
or key ranges, organizations can distribute processing and storage resources more
evenly, minimizing bottlenecks and contention.
2. Data Compression and Serialization: Compressing and serializing data before
processing reduces storage and bandwidth requirements, improves data transfer
speeds, and accelerates query execution. By using efficient compression
algorithms and serialization formats, such as Apache Avro or Protocol Buffers,
organizations can minimize data footprint and optimize resource utilization.
36
20. Processing Data for ML
37
2. Model Serving Infrastructure: Model serving infrastructure enables
organizations to deploy trained machine learning models into production
environments, where they can serve real-time predictions or batch inference
requests. By decoupling model inference from model training, organizations can
achieve greater flexibility, scalability, and reliability in deploying machine learning
solutions.
38
21. Analyzing and Visualizing data
39
Best Practices for Data Analysis and Visualization:
To optimize data analysis and visualization workflows, organizations should
adhere to several best practices, including:
1. Audience-Centric Design: Designing visualizations with the end-user in mind
ensures that insights are communicated effectively and resonate with the intended
audience. Organizations should consider factors such as audience demographics,
preferences, and prior knowledge when designing visualizations.
40
22. Automating the pipeline
Pipeline automation encompasses the techniques and technologies involved in
orchestrating, scheduling, and monitoring data processing tasks across the data
pipeline lifecycle. At its core, pipeline automation serves several critical functions,
including:
1. Workflow Orchestration: Automating the sequencing and dependencies of
data processing tasks ensures that they are executed in the correct order and at the
appropriate times, minimizing delays, errors, and resource contention.
41
Best Practices for Pipeline Automation:
1. Modular Design: Breaking down data processing tasks into smaller, modular
components enables organizations to build reusable, composable workflows that
can be easily scaled, extended, and maintained over time. This promotes flexibility,
agility, and maintainability in data pipeline development.
42
LABS
LAB 1
AMAZON-S3
Amazon Simple Storage Service (Amazon S3) is an object storage service that
offers industry-leading scalability, data availability, security, and performance.
Customers of all sizes and industries can use Amazon S3 to store and protect any
amount of data for a range of use cases, such as data lakes, websites, mobile
applications, backup and restore, archive, enterprise applications, IoT devices,
and big data analytics. Amazon S3 provides management features so that you can
optimize, organize, and configure access to your data to meet your specific
business, organizational, and compliance requirements. [2]
Amazon S3 is an object storage service that stores data as objects within
buckets. An object is a file and any metadata that describes the file. A bucket is
a container for objects. To store your data in Amazon S3, you first create a
bucket and specify a bucket name and AWS Region. Then, you upload your data
to that bucket as objects in Amazon S3. Each object has a key (or key name),
which is the unique identifier for the object within the bucket.
S3 provides features that you can configure to support your specific use case.
For example, you can use S3 Versioning to keep multiple versions of an object
in the same bucket, which allows you to restore objects that are accidentally
deleted or overwritten. Buckets and the objects in them are private and can be
accessed only if you explicitly grant access permissions. You can use bucket
policies, AWS Identity and Access Management (IAM) policies, access control
lists (ACLs), and S3 Access Points to manage access.
43
LAB 2
AMAZON ATHENA:
Amazon Athena is an interactive query service that makes it easy to analyze data
in Amazon S3 using standard SQL. Athena is serverless, so there is no
infrastructure to manage, and you pay only for the queries that you run. Athena
is easy to use. Simply point to your data in Amazon S3, define the schema, and
start querying using standard SQL. Most results are delivered within seconds.
With Athena, there’s no need for complex ETL jobs to prepare your data for
analysis. This makes it easy for anyone with SQL skills to quickly analyze large-
scale datasets. [3]
Athena is out-of-the-box integrated with AWS Glue Data Catalog, allowing you
to create a unified metadata repository across various services, crawl data
sources to discover schemas and populate your Catalog with new and modified
table and partition definitions, and maintain schema versioning.
44
LAB-3
AMAZON GLUE:
AWS Glue is a serverless data integration service that makes it easier to discover,
prepare, move, and integrate data from multiple sources for analytics, machine
learning (ML), and application development.[4]
Preparing your data to obtain quality results is the first step in an analytics or ML
project. AWS Glue is a serverless data integration service that makes data
preparation simpler, faster, and cheaper. You can discover and connect to over 70
diverse data sources, manage your data in a centralized data catalog, and visually
create, run, and monitor ETL pipelines to load data into your data lakes.
45
LAB-4
AMAZON REDSHIFT
Amazon Redshift uses SQL to analyze structured and semi-structured data across
data warehouses, operational databases, and data lakes, using AWS-designed
hardware and machine learning to deliver the best price performance at any scale.
[5]
Fig-4.1(Amazon Redshift)
46
LAB-5
ANALYSE DATA WITH AMAZON SAGEMAKER:
Amazon SageMaker is a fully managed machine learning service. With
SageMaker, data scientists and developers can quickly and easily build and train
machine learning models, and then directly deploy them into a production-ready
hosted environment. It provides an integrated Jupyter authoring notebook
instance for easy access to your data sources for exploration and analysis, so you
don't have to manage servers. It also provides common machine learning
algorithms that are optimized to run efficiently against extremely large data in a
distributed environment. With native support for bring-your-own-algorithms and
frameworks, SageMaker offers flexible distributed training options that adjust to
your specific workflows. Deploy a model into a secure and scalable environment
by launching it with a few clicks from SageMaker Studio or the SageMaker
console. Training and hosting are billed by minutes of usage, with no minimum
fees and no upfront commitments. [6]
Creating jupyter notebook with amazon sagemaker
1. Open the notebook instance as follows:
a. Sign in to the SageMaker console at
https://console.aws.amazon.com/sagemaker/.
b. On the Notebook instances page, open your notebook instance by choosing
either Open JupyterLab for the JupyterLab interface or Open Jupyter for the
classic Jupyter view.
2. Create a notebook as follows:
a. If you opened the notebook in the JupyterLab view, on the File menu, choose
New, and then choose Notebook. For Select Kernel, choose conda_python3. This
preinstalled environment includes the default Anaconda installation and Python
3.
b. If you opened the notebook in the classic Jupyter view, on the Files tab, choose
New, and then choose conda_python3. This preinstalled environment includes the
default Anaconda installation and Python 3.
3. Save the notebooks as follows:
a. In the JupyterLab view, choose File, choose Save Notebook As..., and then
rename the notebook. b. In the Jupyter classic view, choose File, choose Save
as..., and then rename the notebook.
47
LAB-6
LOAD DATA USING PIPELINE:
AWS Data Pipeline is a web service that you can use to automate the movement
and transformation of data.
With AWS Data Pipeline, you can define data-driven workflows, so that tasks can
be dependent on the
successful completion of previous tasks. You define the parameters of your data
transformations and AWS
Data Pipeline enforces the logic that you've set up.
The following components of AWS Data Pipeline work together to manage your
data:
• A pipeline definition specifies the business logic of your data management. For
more information, see
Pipeline Definition File Syntax.
• Apipeline schedules and runs tasks by creating Amazon EC2 instances to perform
the defined work
activities. You upload your pipeline definition to the pipeline, and then activate the
pipeline. You can
edit the pipeline definition for a running pipeline and activate the pipeline again
for it to take effect.
You can deactivate the pipeline, modify a data source, and then activate the pipeline
again. When you
are finished with your pipeline, you can delete it.
• Task Runner polls for tasks and then performs those tasks. For example, Task
Runner could copy log
files to Amazon S3 and launch Amazon EMR clusters. Task Runner is installed and
runs automatically
on resources created by your pipeline definitions. You can write a custom task
runner application, or
you can use the Task Runner application that is provided by AWS Data Pipeline.
For more information,
48
see Task Runners.
49
LAB-7
ANALIZE SREAMING DATA:
Amazon Kinesis Data Firehose is an extract, transform, and load (ETL) service
that reliably captures,
transforms, and delivers streaming data to data lakes, data stores, and analytics
services
.Amazon Kinesis is a suite of services for processing streaming data. With Amazon
Kinesis, you can ingest
real-time data such as video, audio, website clickstreams, or application logs. You
can process and analyze
the data as it arrives, instead of capturing it all to storage before you begin analysis.
50
LAB-8
ANALYSE IOT DATAWITH AWS IOTANALYTICS:
AWS IoT Analytics automates the steps required for analyzing IoT data. You can
filter, transform, and enrich
the data before storing it in a time-series data store. AWS IoT Core provides
connectivity between IoT devices
and AWS Services. IoT Core is fully-integrated with IoT Analytics.
IoT data is highly unstructured which makes it difficult to analyze with traditional
analytics and business
intelligence tools that are designed to process structured data. IoT data comes from
devices that often record
fairly noisy processes (such as temperature, motion, or sound). The data from these
devices can frequently
have significant gaps, corrupted messages, and false readings that must be cleaned
up before analysis can
occur. Also, IoT data is often only meaningful in the context of additional, third
party data inputs. For
example, to help farmers determine when to water their crops, vineyard irrigation
systems often enrich
moisture sensor data with rainfall data from the vineyard, allowing for more
efficient water usage while
maximizing harvest yield. [7]
51
CASE STUDY
The Challenges: When Backcountry engaged Ternary Data for data architecture
strategy, it was clear that they needed an overhaul of their legacy data
infrastructure. Some challenges included:
• Analytics and Reporting - The business and data scientists struggled to access
data within their oversubscribed and cost-prohibitive Oracle on-premises data
warehouse. Queries were slow at an average return time of more than 97
seconds, often timing out after several hours. Storage was extremely limited
due the cost of scaling databases that combine storage with compute, forcing
pruning of valuable data. As a result, Backcountry could only store 1 year's
worth of web data, alongside its historical transaction data. This meant over
90% of the company's available useful data like Google Ads, Facebook,
historical clickstream, machine learning outputs, or large cost centers like
hourly employee costs were not available or accessible by analysts and data
scientists. Additionally, Backcountry used OBIEE for its business intelligence,
which was cumbersome and lacked both basic and advanced functionalities the
team needed to improve data governance and reporting automation.
• Overly complex infrastructure - Backcountry was maintaining both a data lake
(Amazon S3 with Databricks) and a data warehouse (Oracle). Managing data
and schemas in the data lake was a significant labor cost, and keeping data in
52
sync across Oracle, S3 and other systems was a constant struggle. They also
use some legacy Microsoft SQL Server and Postgres databases for reporting
right now, which will eventually be deprecated.
• Data pipelines - Backcountry used Talend, which functioned in a pure
chronological and on-prem stack. As the company grew they increasingly
required event-based (Airflow) pipelines, and cloud-native partners.
53
• Machine learning - Data science initiatives were slow to deliver value due to
the limitations of the Oracle data warehouse and incoherent data pipelines.
While the team had shipped demonstrably high value machine learning models,
scaling beyond basic machine learning, or scaling beyond a single department
was made impossible by the stack. Even a 10% downsampled training dataset
often took over 10 hours to pull from the database, and most of the time these
larger analytic queries would simply fail after a few hours of running.
Additionally the lack of data governance and clear business-
defined/democratized data definitions and metrics meant that even when
models could be built, they could be optimizing toward incorrect metrics.
Requirements:
• Analytics and Reporting - The business must get fast and actionable insights,
with an SLA of 30 seconds for query response. 95% of queries must return
with SLA.
• Simple Infrastructure - The system should be cost effective and serverless
wherever possible - Avoid undifferentiated heavy lifting. Allow data
engineers to focus on data rather than operations.
• Data pipelines - The new data pipelines must scale as data and analytics
requirements grow in the future. The tool must seamlessly integrate with
BigQuery.
• Machine learning - Data scientists need a platform for rapid experimentation
and the ability to put their models into production.
• Marketing- Backcountry wants to integrate Google Analytics, Ads and
54
2. Simple Infrastructure
o The Google Cloud Platform Marketplace significantly reduces the
complexity and cost of Backcountry’s data stack. Backcountry now
interfaces with one main vendor, and pays one bill.
o Backcountry’s data team builds expertise in one cloud platform, on
Google Cloud’s best of breed data technologies.
3. Data Pipelines
o Cloud Composer is based on the open source Apache Airflow project and
the Python programming language, making it a good fit for the team’s
skillset. It provides deep integration with the entire GCP ecosystem,
combined with the power to connect to nearly any data source through
custom code.
o Fivetran’s hands-off and seamless data pipelines into BigQuery greatly
reduced the operational burden on Backcountry’s data team.
4. Machine learning
o Data scientists are able to experiment and iterate very quickly using
Google
Cloud’s AI Platform, BigQuery ML, and AutoML.
o Backcountry can rapidly incorporate machine learning models into
production, generating a lot of value for both Backcountry and its
customers.
Result:
Data is available and useful across the business. Actionable data is readily
available for the business and machine learning projects, using BigQuery and
Looker.
Reduction in query times for key data. With BigQuery as the data
warehouse, query times for historical sales went from taking 10 hours to
under 3 minutes.
55
56
CONCLUSION
The challenges of slow query times, limited data accessibility, and complex
infrastructure were effectively addressed through strategic implementation of Google
Cloud Platform services. BigQuery emerged as a game-changer, reducing query times
from hours to minutes and providing actionable data for both business operations and
machine learning initiatives.
57
REFERENCE LINKS
• Amazon-S3
https://docs.aws.amazon.com/AmazonS3/latest/userguide/We
lcome.html
• Amazon Athena
https://aws.amazon.com/athena/?whats-new-cards.sort-
by=item.additionalFields.postDateTime&whats-new-
cards.sort-order=desc
• Amazon Glue
https://aws.amazon.com/glue/
• Amazon Redshift
https://aws.amazon.com/redshift/
• Amazon Sagemaker
https://docs.aws.amazon.com/sagemaker/latest/dg/whatis.htm
l
• Case Study
https://www.ternarydata.com/case-study-backcountrycom
58