Lecture 11 Google Architecture Design
Lecture 11 Google Architecture Design
Main Content
❖ Crawling
▪ To locate and retrieve the contents of the
Web and pass the contents onto the
indexing subsystem.
▪ Performed by Googlebot with the deep
searching technique.
▪ Caffeine in 2010 (continuous process of
crawling intended to offer more
freshness)
Google as Search Engine
❖ Indexing
▪ To produce an index for the contents of
the Web (like index at the back of books)
▪ Inverted index mapping words appearing
in web pages and other textual web
resources onto the positions occurring in
documents
▪ Also keep track of which pages link to a
given site
▪ Using the index to narrow down the set of
candidate web pages from billions to
perhaps tens of thousands
Google as Search Engine
❖ Ranking
▪ Indexing provides no information about
the relative importance of the web pages
containing a particular set of keywords
▪ Ranking focus on a system of ranking
whereby a higher rank is an indication of
the importance of a page
▪ Google uses the PageRank algorithm
▪ Ranking in Google is affected by many
factors (up to 200 factors)
Google as Search Engine
❖ Key philosophy
▪ Use very large numbers of commodity PCs to
produce a cost-effective environment for
distributed storage and computation
▪ Design the infrastructure using a range of
strategies to tolerate such failures
❖ Physical structures
▪ Commodity PCs are organized in racks with
between 40 and 80 PCs in a given rack.
▪ Racks are organized into clusters
▪ Each rack is connected to both switches for
redundancy;
▪ Clusters are housed in Google data centres that
are spread around the world
Overall system architecture
❖Requirements
▪ Scalability
• Being able to deal with more data
• Being able to deal with more queries
• Seeking better results
▪ Reliability:
• Stringent reliability requirements, especially with
regard toavailability of services.
• a 99.9% service level agreement (effectively, a system
guarantee) to paying customers of Google Apps
covering Gmail, Google Calendar, Google Docs,
Google Sites and Google Talk.
▪ Performance
• Overall performance of the system is critical
• Target of completing web search operations in 0.2
seconds
Overall system architecture
❖Requirements (Continues)
▪ Openness
• Support further development in the
range of web applications
• an infrastructure that is extensible and
provides support for the development
of new applications
Overall system architecture
❖ Google infrastructure
▪ Communication paradims
• Protocol buffers component
• Google publish-subscriber service
▪ Data and coordination services: GFS,
Chubby, Bigtable
▪ Distributed computation services
• MapReduce
• Sawzall
Overall system architecture
❖ GFS interface:
▪ Provides a conventional file system
interface offering a hierarchical
namespace with individual files identified
by pathnames
▪ Many of the operations will be familiar to
users of such file systems
Google File System (GFS)
❖ GFS architecture:
▪ storage of files in fixed-size chunks
▪ the job of GFS is to provide a mapping
from files to chunks and then to support
standard operations on files
▪ Each GFS cluster has a single master
and multiple chunkservers
Chubby
❖ Chubby interface
▪ Provides an abstraction based on a file system
▪ Files are organized into a hierarchical
namespace using directory structures
Chubby
❖ Chubby architecture:
▪ A single instance of a Chubby system is known
as a cell;
▪ Each cell consists of a relatively small number
of replicas (typically five) with one designated
as the master
▪ Client applications access this set of replicas
via a Chubby library
▪ Each replica maintains a small database whose
elements are entities in the Chubby namespace
▪ A Chubby session is a relationship between a
client and a Chubby cell
Chubby
❖ Chubby architecture:
Bigtable
❖ Bigtable interface
▪ Distributed storage system that supports the
storage of potentially vast volumes of structured
data
▪ Given table is a three-dimensional structure
containing cells indexed by a row key, a column
key and a timestamp
▪ Bigtable API:
• creation, deletion of tables, column families within
tables;
• accessing data from given rows;
• writing or deleting cell values
• carrying out atomic row mutations
• iterating over different column families…
Bigtable
❖ Bigtable interface
Bigtable
❖ Bigtable architecture
▪ A Bigtable is broken up into tablets, with a given
tablet being approximately 100–200 megabytes
in size
▪ Bigtable is to manage tablets and to support the
operations described above for accessing and
changing the associated structured data
▪ A single instance of a Bigtable implementation
is known as a cluster
▪ Each cluster can store a number of tables
Bigtable
❖ Bigtable architecture
Design choices
Distributed computation
services
• MapReduce
• Sawzall
MapReduce
❖ MapReduce interface
MapReduce
❖ MapReduce architecture
▪ Implemented by a library focusing on specifying the
map and reduce functions. Key phases:
• The first stage is to split the input file into M pieces
• The MapReduce library then starts a set of worker machines
(workers) from the pool available in the cluster
• A worker will first read the contents of the input file allocated
to that map task, extract the key-value pairs and supply them
as input to the map function
• The intermediary buffers are periodically written to a file local
to the map computation
• When a worker is assigned to carry out a reduce function, it
reads its corresponding partition from the local disk of the
map workers using RPC
MapReduce
❖ MapReduce architecture
Sawzall
❖ Sawzall is an interpreted
programming language for performing
parallel data analysis over very large
datasets in highly distributed
environments such as that provided by
the physical The Google infrastructure.
Design choices
Thank You !