0% found this document useful (0 votes)
9 views

Lecture 11 Google Architecture Design

Uploaded by

flowerinthedawnn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Lecture 11 Google Architecture Design

Uploaded by

flowerinthedawnn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Google Architecture Design

Main Content

❖ Introduction the case study: Google


❖ Overall architecture and design
philosophy
❖ Underlying communication paradigms
❖ Data storage and coordination
services
❖ Distributed computation services
❖ Summary
Google
Google case study

❖ Google is a US-based corporation


with its headquarters in Mountain View
❖ Offering Internet search and broader
web applications and earning revenue
largely from advertising
❖ Google ~ googol (10^100)
❖ Google was born out of a research
project at Stanford University (1998)
➢ It is a fascinating case study with
extremely demanding requirements,
particularly in terms of scalability,
reliability, performance and openness
Google as Search Engine

❖ Crawling
▪ To locate and retrieve the contents of the
Web and pass the contents onto the
indexing subsystem.
▪ Performed by Googlebot with the deep
searching technique.
▪ Caffeine in 2010 (continuous process of
crawling intended to offer more
freshness)
Google as Search Engine

❖ Indexing
▪ To produce an index for the contents of
the Web (like index at the back of books)
▪ Inverted index mapping words appearing
in web pages and other textual web
resources onto the positions occurring in
documents
▪ Also keep track of which pages link to a
given site
▪ Using the index to narrow down the set of
candidate web pages from billions to
perhaps tens of thousands
Google as Search Engine

❖ Ranking
▪ Indexing provides no information about
the relative importance of the web pages
containing a particular set of keywords
▪ Ranking focus on a system of ranking
whereby a higher rank is an indication of
the importance of a page
▪ Google uses the PageRank algorithm
▪ Ranking in Google is affected by many
factors (up to 200 factors)
Google as Search Engine

❖ Anatomy of a search engine


Google as Cloud Provider

❖ Offering other web-based applications


❖ Google is a major player in the area
of cloud computing
❖ Software as a service:
▪ Offering application-level software over
the Internet as web applications.
▪ Google Apps: Docs, Calendar, Sites…
❖ Platform as a service
▪ Offering distributed system APIs as
services across the Internet
▪ APIs used to support the development
and hosting of web applications
Overall Architecture and
Design Philosophy
• Physical model
• Overall system architecture
Physical Model

❖ Key philosophy
▪ Use very large numbers of commodity PCs to
produce a cost-effective environment for
distributed storage and computation
▪ Design the infrastructure using a range of
strategies to tolerate such failures
❖ Physical structures
▪ Commodity PCs are organized in racks with
between 40 and 80 PCs in a given rack.
▪ Racks are organized into clusters
▪ Each rack is connected to both switches for
redundancy;
▪ Clusters are housed in Google data centres that
are spread around the world
Overall system architecture

❖Requirements
▪ Scalability
• Being able to deal with more data
• Being able to deal with more queries
• Seeking better results
▪ Reliability:
• Stringent reliability requirements, especially with
regard toavailability of services.
• a 99.9% service level agreement (effectively, a system
guarantee) to paying customers of Google Apps
covering Gmail, Google Calendar, Google Docs,
Google Sites and Google Talk.
▪ Performance
• Overall performance of the system is critical
• Target of completing web search operations in 0.2
seconds
Overall system architecture

❖Requirements (Continues)
▪ Openness
• Support further development in the
range of web applications
• an infrastructure that is extensible and
provides support for the development
of new applications
Overall system architecture

❖ Google infrastructure
▪ Communication paradims
• Protocol buffers component
• Google publish-subscriber service
▪ Data and coordination services: GFS,
Chubby, Bigtable
▪ Distributed computation services
• MapReduce
• Sawzall
Overall system architecture

❖Associated design principles


▪ Simplicity: do one thing and do it well
▪ Performance: every millisecond counts
▪ Stringent testing regimes on software
Underlying
Communication
Paradigms
• Remote invocation
• Publish-subscribe
• Summary of key design choices for communication
Remote invocation

❖ Protocol buffers used for a substantial


majority of interactions within the
infrastructure
❖ Provide a language- and platform-neutral
way to specify and serialize data
❖ A language is provided for the specification
of messages
❖ Most common use of protocol buffers, is to
specify RPC exchanges across the network
Remote invocation
Publish-subscribe

❖ A publish-subscribe system intended to


be used where distributed events need to
be disseminated in real time and with
reliability guarantees to potentially large
numbers of recipients.
❖ Google adopts a topic-based publish-
subscribe system, providing several
channels for event streams
❖ A strong emphasis on both reliable and
timely delivery
▪ Reliability: the system maintains redundant
trees;
▪ Timely delivery: a quality of service management
technique to control message flows
Key design choices
Key design choices
Data storage and
coordination services
• Google File System (GFS)
• Chubby
• Bigtable
Google File System (GFS)

❖ GFS is a kind of distributed file system


❖ It offers similar abstractions but is
specialized for the very particular
requirements
❖ GFS Requirements
▪ Run reliably on the physical architecture
▪ Be optimized for the patterns of usage within
Google (the types of files stored and the
patterns of access to those files)
▪ Meet all the requirements for the Google
infrastructure as a whole
Google File System (GFS)

❖ GFS interface:
▪ Provides a conventional file system
interface offering a hierarchical
namespace with individual files identified
by pathnames
▪ Many of the operations will be familiar to
users of such file systems
Google File System (GFS)

❖ GFS architecture:
▪ storage of files in fixed-size chunks
▪ the job of GFS is to provide a mapping
from files to chunks and then to support
standard operations on files
▪ Each GFS cluster has a single master
and multiple chunkservers
Chubby

❖ Chubby is a crucial service at the heart of


the Google infrastructure offering storage
and coordination services for other
infrastructure services
❖ Four distinct capabilities
▪ Provides coarse-grained distributed locks to
synchronize distributed activities
▪ Provides a file system offering the reliable
storage of small files
▪ Support the election of a primary in a set of
replicas
▪ Used as a name service within Google
Chubby

❖ Chubby interface
▪ Provides an abstraction based on a file system
▪ Files are organized into a hierarchical
namespace using directory structures
Chubby

❖ Chubby architecture:
▪ A single instance of a Chubby system is known
as a cell;
▪ Each cell consists of a relatively small number
of replicas (typically five) with one designated
as the master
▪ Client applications access this set of replicas
via a Chubby library
▪ Each replica maintains a small database whose
elements are entities in the Chubby namespace
▪ A Chubby session is a relationship between a
client and a Chubby cell
Chubby

❖ Chubby architecture:
Bigtable

❖ Google would be to implement (or reuse) a


distributed database with full operators
❖ The achievement of good performance and
scalability in such distributed databases is
recognized as a difficult problem
❖ Google therefore has introduced Bigtable
which retains the table model offered by
relational databases but with a much
simpler interface designed to support the
efficient storage and retrieval of quite
massive structured datasets
Bigtable

❖ Bigtable interface
▪ Distributed storage system that supports the
storage of potentially vast volumes of structured
data
▪ Given table is a three-dimensional structure
containing cells indexed by a row key, a column
key and a timestamp
▪ Bigtable API:
• creation, deletion of tables, column families within
tables;
• accessing data from given rows;
• writing or deleting cell values
• carrying out atomic row mutations
• iterating over different column families…
Bigtable

❖ Bigtable interface
Bigtable

❖ Bigtable architecture
▪ A Bigtable is broken up into tablets, with a given
tablet being approximately 100–200 megabytes
in size
▪ Bigtable is to manage tablets and to support the
operations described above for accessing and
changing the associated structured data
▪ A single instance of a Bigtable implementation
is known as a cluster
▪ Each cluster can store a number of tables
Bigtable

❖ Bigtable architecture
Design choices
Distributed computation
services
• MapReduce
• Sawzall
MapReduce

❖ MapReduce is a simple programming


model to support the development of
such applications, hiding underlying
detail from the programmer including:
▪ details related to the parallelization of the
computation
▪ monitoring and recovery from failure
▪ data management and load balancing
onto the underlying physical
infrastructure
MapReduce

❖ MapReduce interface: key principle


behind MapReduce is the recognition
that many parallel computations share
the same overall pattern
▪ break the input data into a number of
chunks;
▪ carry out initial processing on these
chunks of data to produce intermediary
results;
▪ combine the intermediary results to
produce the final output.
MapReduce

❖ MapReduce interface
MapReduce

❖ MapReduce architecture
▪ Implemented by a library focusing on specifying the
map and reduce functions. Key phases:
• The first stage is to split the input file into M pieces
• The MapReduce library then starts a set of worker machines
(workers) from the pool available in the cluster
• A worker will first read the contents of the input file allocated
to that map task, extract the key-value pairs and supply them
as input to the map function
• The intermediary buffers are periodically written to a file local
to the map computation
• When a worker is assigned to carry out a reduce function, it
reads its corresponding partition from the local disk of the
map workers using RPC
MapReduce

❖ MapReduce architecture
Sawzall

❖ Sawzall is an interpreted
programming language for performing
parallel data analysis over very large
datasets in highly distributed
environments such as that provided by
the physical The Google infrastructure.
Design choices
Thank You !

You might also like