0% found this document useful (0 votes)
8 views28 pages

DS Lecture 5

The document discusses distributed file systems, focusing on client-server architectures and the Google File System (GFS), which is designed for large data-intensive applications and fault tolerance. It also introduces parallel file systems, which separate metadata and data storage across multiple servers, and highlights the Hadoop Distributed File System (HDFS) as an open-source implementation inspired by GFS, emphasizing its scalability and fault tolerance. Key design goals for both GFS and HDFS include handling large files, high throughput, and the ability to run on commodity hardware.

Uploaded by

mahmoudweso2003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views28 pages

DS Lecture 5

The document discusses distributed file systems, focusing on client-server architectures and the Google File System (GFS), which is designed for large data-intensive applications and fault tolerance. It also introduces parallel file systems, which separate metadata and data storage across multiple servers, and highlights the Hadoop Distributed File System (HDFS) as an open-source implementation inspired by GFS, emphasizing its scalability and fault tolerance. Key design goals for both GFS and HDFS include handling large files, high throughput, and the ability to run on commodity hardware.

Uploaded by

mahmoudweso2003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Distributed Systems

Dr\ Eman Monir

Faculty of Computers and Artificial Intelligence


Benha University

Fall 2023-2022
File System
Client-server file systems

Consists of :
• Central servers
-Point of congestion, single point of failure
• Alleviate somewhat with replication and client caching
– E.g., Coda, tokens, (aka leases, oplocks)
– Limited replication can lead to congestion
• File data is still centralized
– A file server stores all data from a file
–not split across servers – Even if replication is in place, a client
downloads all data for a file from one server
• File sizes are limited to the capacity available on a server
– What if you need a 1,000 TB file?
Google File System (GFS)
GFS Goals

 Scalable distributed file system


 Designed for large data-intensive applications
 Fault-tolerant; runs on commodity hardware
 Delivers high performance to a large number of clients
Design Assumptions

• Assumptions for conventional file systems don’t work


– E.g., “most files are small”, “lots have short lifetimes”
• Component failures are the norm, not an exception
– File system = thousands of storage machines
– Some % not working at any given time
• Files are huge. Multi-TB files are the norm
– It doesn’t make sense to work with billions of n KB-sized files
– I/O operations and block size choices are also affected
Design Assumptions
• File access:
– Most files are appended, not overwritten
• Random writes within a file are almost never done
• Once created, files are mostly read; often sequentially
– Workload is mostly:
• Reads: large streaming reads, small random reads – thes dominate
• Large appends
• Hundreds of processes may append to a file concurrently
• GFS will store a modest number of files for its scale
- approx. a few
million
• Designing the GFS API with the design of apps benefits the system
• Apps can handle a relaxed consistency model
What is a parallel file system?

Conventional file systems


– Store data & metadata on the same storage device
– Example:
• Linux directories are just files that contain lists of names & inodes
Definition • inodes are data structures placed in well-defined areas of the disk that contain

information about the file


• Lists of block numbers containing file data are allocated from the same set of data
blocks used for file data
Parallel file systems:
– File data can span multiple servers
– Metadata can be on separate servers
– Metadata = information about the file
• Includes name, access permissions, timestamps, file size, & locations of data blocks
– Data = actual file contents
Basic Design Idea
• Use separate servers to store metadata
– Metadata includes lists of (server, block_number) sets that identify which blocks on
which servers hold file data
– We need more bandwidth for data access than metadata access • Metadata is small;
file data can be huge
• Use large logical blocks
– Most "normal" file systems are optimized for small files
• A block size is typically 4KB
– Expect huge files, so use huge blocks … >1,000x larger
• The list of blocks that makes up a file becomes easier to manage
• Replicate data
– Expect some servers to be down
– Store copies of data blocks on multiple servers
HDFS: Hadoop Distributed File System
Primary storage system for Hadoop applications
• Hadoop
– Software library – framework that allows for the distributed processing of large data sets across clusters of
computers
• Hadoop includes: 1-Hadoop Distributed File System
2 – MapReduce™: software framework for distributed processing of large data sets on compute

clusters.
EX: Hadoop 1– Avro™: A data serialization system.

–2 Cassandra™: A scalable multi-master database with no single points of failure.


3
– Chukwa™: A data collection system for managing large distributed systems.
4
– HBase™: A scalable, distributed database that supports structured data storage for large tables.
5
–Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.
–6 Mahout™: A Scalable machine learning and data mining library.
7
–Pig™: A high-level data-flow language and execution framework for parallel computation.
8
– ZooKeeper™: A high-performance coordination service for distributed applications.
HDFS Design Goals & Assumptions
Definition

• HDFS is an open source (Apache) implementation inspired by GFS design

• Similar goals and same basic design as GFS


– Run on commodity hardware
– Highly fault tolerant – High throughput
– Designed for large data sets
– OK to relax some POSIX requirements
– Large scale deployments
• Instance of HDFS may comprise 1000s of servers
• Each server stores part of the file system’s data
• But
– No support for concurrent appends
HDFS Architecture

• Written in Java
• Master/Slave architecture
• Single NameNode
– Master server responsible for the namespace & access control
• Multiple DataNodes
– Responsible for managing storage attached to its node
• A file is split into one or more blocks
– Typical block size = 128 MB (vs. 64 MB for GFS)
– Blocks are stored in a set of DataNodes

You might also like