DS Lecture 5

The document discusses distributed file systems, focusing on client-server architectures and the Google File System (GFS), which is designed for large data-intensive applications and fault tolerance. It also introduces parallel file systems, which separate metadata and data storage across multiple servers, and highlights the Hadoop Distributed File System (HDFS) as an open-source implementation inspired by GFS, emphasizing its scalability and fault tolerance. Key design goals for both GFS and HDFS include handling large files, high throughput, and the ability to run on commodity hardware.

Uploaded by

mahmoudweso2003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views28 pages

DS Lecture 5

Uploaded by

mahmoudweso2003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

Distributed Systems

Dr\ Eman Monir

Faculty of Computers and Artificial Intelligence

Benha University

Fall 2023-2022
File System
Client-server file systems

Consists of :
• Central servers
-Point of congestion, single point of failure
• Alleviate somewhat with replication and client caching
– E.g., Coda, tokens, (aka leases, oplocks)
– Limited replication can lead to congestion
• File data is still centralized
– A file server stores all data from a file
–not split across servers – Even if replication is in place, a client
downloads all data for a file from one server
• File sizes are limited to the capacity available on a server
– What if you need a 1,000 TB file?
Google File System (GFS)
GFS Goals

 Scalable distributed file system

 Designed for large data-intensive applications
 Fault-tolerant; runs on commodity hardware
 Delivers high performance to a large number of clients
Design Assumptions

• Assumptions for conventional file systems don’t work

– E.g., “most files are small”, “lots have short lifetimes”
• Component failures are the norm, not an exception
– File system = thousands of storage machines
– Some % not working at any given time
• Files are huge. Multi-TB files are the norm
– It doesn’t make sense to work with billions of n KB-sized files
– I/O operations and block size choices are also affected
Design Assumptions
• File access:
– Most files are appended, not overwritten
• Random writes within a file are almost never done
• Once created, files are mostly read; often sequentially
– Workload is mostly:
• Reads: large streaming reads, small random reads – thes dominate
• Large appends
• Hundreds of processes may append to a file concurrently
• GFS will store a modest number of files for its scale
- approx. a few
million
• Designing the GFS API with the design of apps benefits the system
• Apps can handle a relaxed consistency model
What is a parallel file system?

Conventional file systems

– Store data & metadata on the same storage device
– Example:
• Linux directories are just files that contain lists of names & inodes
Definition • inodes are data structures placed in well-defined areas of the disk that contain

information about the file

• Lists of block numbers containing file data are allocated from the same set of data
blocks used for file data
Parallel file systems:
– File data can span multiple servers
– Metadata can be on separate servers
– Metadata = information about the file
• Includes name, access permissions, timestamps, file size, & locations of data blocks
– Data = actual file contents
Basic Design Idea
• Use separate servers to store metadata
– Metadata includes lists of (server, block_number) sets that identify which blocks on
which servers hold file data
– We need more bandwidth for data access than metadata access • Metadata is small;
file data can be huge
• Use large logical blocks
– Most "normal" file systems are optimized for small files
• A block size is typically 4KB
– Expect huge files, so use huge blocks … >1,000x larger
• The list of blocks that makes up a file becomes easier to manage
• Replicate data
– Expect some servers to be down
– Store copies of data blocks on multiple servers
HDFS: Hadoop Distributed File System
Primary storage system for Hadoop applications
• Hadoop
– Software library – framework that allows for the distributed processing of large data sets across clusters of
computers
• Hadoop includes: 1-Hadoop Distributed File System
2 – MapReduce™: software framework for distributed processing of large data sets on compute

clusters.
EX: Hadoop 1– Avro™: A data serialization system.

–2 Cassandra™: A scalable multi-master database with no single points of failure.

3
– Chukwa™: A data collection system for managing large distributed systems.
4
– HBase™: A scalable, distributed database that supports structured data storage for large tables.
5
–Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.
–6 Mahout™: A Scalable machine learning and data mining library.
7
–Pig™: A high-level data-flow language and execution framework for parallel computation.
8
– ZooKeeper™: A high-performance coordination service for distributed applications.
HDFS Design Goals & Assumptions
Definition

• HDFS is an open source (Apache) implementation inspired by GFS design

• Similar goals and same basic design as GFS

– Run on commodity hardware
– Highly fault tolerant – High throughput
– Designed for large data sets
– OK to relax some POSIX requirements
– Large scale deployments
• Instance of HDFS may comprise 1000s of servers
• Each server stores part of the file system’s data
• But
– No support for concurrent appends
HDFS Architecture

• Written in Java
• Master/Slave architecture
• Single NameNode
– Master server responsible for the namespace & access control
• Multiple DataNodes
– Responsible for managing storage attached to its node
• A file is split into one or more blocks
– Typical block size = 128 MB (vs. 64 MB for GFS)
– Blocks are stored in a set of DataNodes

01 Free Bug Tracker Template Excel
0% (1)
01 Free Bug Tracker Template Excel
9 pages
Log
No ratings yet
Log
2,682 pages
Cloud OS Presentation (068,076,77,095)
No ratings yet
Cloud OS Presentation (068,076,77,095)
24 pages
An Introduction To Visual Basic 2010
100% (1)
An Introduction To Visual Basic 2010
71 pages
Bigdata Lecture 2
No ratings yet
Bigdata Lecture 2
17 pages
Unit-4 (Memory Management)
No ratings yet
Unit-4 (Memory Management)
103 pages
Unit-4 BDA As On 25-11-2024
No ratings yet
Unit-4 BDA As On 25-11-2024
258 pages
Azure App Service
No ratings yet
Azure App Service
1,959 pages
Distributed File System
100% (1)
Distributed File System
17 pages
Distributed Systems: (3rd Edition)
No ratings yet
Distributed Systems: (3rd Edition)
36 pages
Lecture 25: Distributed File Systems: Indranil Gupta (Indy)
No ratings yet
Lecture 25: Distributed File Systems: Indranil Gupta (Indy)
27 pages
CS621 Week 15
No ratings yet
CS621 Week 15
64 pages
Discrete Computing
No ratings yet
Discrete Computing
25 pages
Lecture 4.1 - Hadoop - MapReduce - Hbase
No ratings yet
Lecture 4.1 - Hadoop - MapReduce - Hbase
94 pages
Hadoop
No ratings yet
Hadoop
25 pages
Unit 4 Distributed Systems
No ratings yet
Unit 4 Distributed Systems
35 pages
Tivoli Monitoring For Databases Oracle Agent
No ratings yet
Tivoli Monitoring For Databases Oracle Agent
232 pages
Distributed System DS Unit5
No ratings yet
Distributed System DS Unit5
61 pages
Assisnment # 1 Os
No ratings yet
Assisnment # 1 Os
7 pages
Assisnment # 1 Os
No ratings yet
Assisnment # 1 Os
6 pages
DBMS Final
No ratings yet
DBMS Final
21 pages
The Google File System: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google
No ratings yet
The Google File System: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google
15 pages
Distributed File Systems
No ratings yet
Distributed File Systems
6 pages
DATA228 Lecture Notes Week 4
No ratings yet
DATA228 Lecture Notes Week 4
21 pages
When It Comes To Cloud File Systems Like GFS
No ratings yet
When It Comes To Cloud File Systems Like GFS
6 pages
Chap 6
No ratings yet
Chap 6
54 pages
Technical Note #42: Upgrading PMCS 5.0 Installations To PMCS 5.1
No ratings yet
Technical Note #42: Upgrading PMCS 5.0 Installations To PMCS 5.1
3 pages
SW 2.0.0 New Features Rev.02 English
No ratings yet
SW 2.0.0 New Features Rev.02 English
43 pages
Distributed File Systems
No ratings yet
Distributed File Systems
35 pages
BDA Unit 1 Notes
No ratings yet
BDA Unit 1 Notes
24 pages
Hadoop Distributed File System: Presented by Mohammad Sufiyan Nagaraju Kola Prudhvi Krishna Kamireddy
No ratings yet
Hadoop Distributed File System: Presented by Mohammad Sufiyan Nagaraju Kola Prudhvi Krishna Kamireddy
17 pages
DC - PPT A Case Study On Distributed File Systems
No ratings yet
DC - PPT A Case Study On Distributed File Systems
17 pages
What Is DFS
No ratings yet
What Is DFS
37 pages
GFS Large Scale
No ratings yet
GFS Large Scale
7 pages
Storage Systems
No ratings yet
Storage Systems
23 pages
Rev. Lecture 1 PPT2
No ratings yet
Rev. Lecture 1 PPT2
24 pages
Distributed Computing Module 5 Important Topics PYQs
No ratings yet
Distributed Computing Module 5 Important Topics PYQs
23 pages
Instructions To Download and Install The Rde Programming Suite (Robox Development Environment)
No ratings yet
Instructions To Download and Install The Rde Programming Suite (Robox Development Environment)
10 pages
(Fix) Windows 7 Keeps Checking For Updates For Hours - AskVG
No ratings yet
(Fix) Windows 7 Keeps Checking For Updates For Hours - AskVG
9 pages
HDFS
No ratings yet
HDFS
22 pages
Distributed File Systems
No ratings yet
Distributed File Systems
35 pages
Chapter 2 1712934164766
No ratings yet
Chapter 2 1712934164766
21 pages
Ovftool 460 Userguide
No ratings yet
Ovftool 460 Userguide
61 pages
36 DC Expt9
No ratings yet
36 DC Expt9
4 pages
Module III Hadoop Framework
No ratings yet
Module III Hadoop Framework
21 pages
Prelims Lab Exercise #2 - M1U2
No ratings yet
Prelims Lab Exercise #2 - M1U2
22 pages
Haoop Architecture
No ratings yet
Haoop Architecture
34 pages
DC Mod 6
No ratings yet
DC Mod 6
9 pages
The Hadoop Approach
100% (2)
The Hadoop Approach
14 pages
FNDCPASS Utility New Feature Non-Reversible Hash Password
No ratings yet
FNDCPASS Utility New Feature Non-Reversible Hash Password
4 pages
HDFS Map Reduce
No ratings yet
HDFS Map Reduce
16 pages
Brown and Black Modern Watercolor Presentation
No ratings yet
Brown and Black Modern Watercolor Presentation
11 pages
Mike C Map
No ratings yet
Mike C Map
2 pages
Master Data Governance Mass Import Solution For Retail and Fashion Management - Article Substitution
No ratings yet
Master Data Governance Mass Import Solution For Retail and Fashion Management - Article Substitution
18 pages
Web Servers (PWS, IIS, Apache, Jigsaw) : Fig. 24.1 A Web Server Communicating With Several HTTP Clients
No ratings yet
Web Servers (PWS, IIS, Apache, Jigsaw) : Fig. 24.1 A Web Server Communicating With Several HTTP Clients
28 pages
Google File System
No ratings yet
Google File System
48 pages
Network Simulators: The Pros of Packet Tracer Network Simulator
No ratings yet
Network Simulators: The Pros of Packet Tracer Network Simulator
4 pages
4
No ratings yet
4
53 pages
Drop Box
No ratings yet
Drop Box
36 pages
Lec 11 - Distributed Files - Distributed File System
No ratings yet
Lec 11 - Distributed Files - Distributed File System
33 pages
ICS 408 Exam A
No ratings yet
ICS 408 Exam A
5 pages
USB Booting - Hiren's BootCD PE
No ratings yet
USB Booting - Hiren's BootCD PE
3 pages
L6 DFS
No ratings yet
L6 DFS
27 pages
3.1 Hadoop Ecosystem
No ratings yet
3.1 Hadoop Ecosystem
48 pages
BDA Unit I
No ratings yet
BDA Unit I
18 pages
ReaQta-Hive v3.9 Administration Guide v1
100% (1)
ReaQta-Hive v3.9 Administration Guide v1
90 pages
HDFS 3
No ratings yet
HDFS 3
51 pages
DW - Bigdata9
No ratings yet
DW - Bigdata9
113 pages
# Thomas Taihei - READ ME
No ratings yet
# Thomas Taihei - READ ME
2 pages
BIG DATA - Unit 4 HADOOP AND MAP REDUCE - Mini Xerox - Easy Read
No ratings yet
BIG DATA - Unit 4 HADOOP AND MAP REDUCE - Mini Xerox - Easy Read
16 pages
Sap Basis l2
No ratings yet
Sap Basis l2
13 pages
Hadoop and Big Data Unit 2
No ratings yet
Hadoop and Big Data Unit 2
11 pages
Mapreduce: Simplified Data Processing On Large Clusters
No ratings yet
Mapreduce: Simplified Data Processing On Large Clusters
38 pages
Chapter 4 - Hadoop Ecosystem
No ratings yet
Chapter 4 - Hadoop Ecosystem
24 pages
CS19741-Cloud Computing-Unit 3 Notes
No ratings yet
CS19741-Cloud Computing-Unit 3 Notes
37 pages
Linux Programming Question Bank
No ratings yet
Linux Programming Question Bank
8 pages
DFSNov 1
No ratings yet
DFSNov 1
36 pages
BigData Unit 2
No ratings yet
BigData Unit 2
56 pages
What Is Distributed Data Processing?
No ratings yet
What Is Distributed Data Processing?
2 pages
VMware AirWatch Browser Admin Guide
No ratings yet
VMware AirWatch Browser Admin Guide
31 pages
5.distributed File System
No ratings yet
5.distributed File System
86 pages
1564-Article Text-2810-1-10-20171231 PDF
No ratings yet
1564-Article Text-2810-1-10-20171231 PDF
5 pages
Labwindows /cvi: Release Notes
No ratings yet
Labwindows /cvi: Release Notes
15 pages
Citroenforum hr-MIRASCRIPT PDF
No ratings yet
Citroenforum hr-MIRASCRIPT PDF
6 pages
CV10 Core Fundamentals Course Student Guide - IndexCache Portion Is Very Wrong
100% (1)
CV10 Core Fundamentals Course Student Guide - IndexCache Portion Is Very Wrong
181 pages
Shaik Khaja Muneer Resume Latest
No ratings yet
Shaik Khaja Muneer Resume Latest
6 pages
GPS Vs Hdfs
No ratings yet
GPS Vs Hdfs
6 pages
Penting
No ratings yet
Penting
3 pages
Cloud Unit3
No ratings yet
Cloud Unit3
26 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet

DS Lecture 5

Uploaded by

DS Lecture 5

Uploaded by

Distributed Systems

Dr\ Eman Monir

Faculty of Computers and Artificial Intelligence

 Scalable distributed file system

• Assumptions for conventional file systems don’t work

Conventional file systems

information about the file

–2 Cassandra™: A scalable multi-master database with no single points of failure.

• HDFS is an open source (Apache) implementation inspired by GFS design

• Similar goals and same basic design as GFS

You might also like