0% found this document useful (0 votes)

31 views24 pages

Rev. Lecture 1 PPT2

Uploaded by

Pro Fessor

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views24 pages

Rev. Lecture 1 PPT2

Uploaded by

Pro Fessor

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 24

Big Data Analysis(CP 420)

UNIT 1
Introduction – distributed file system
Lecture 1

Presented By:
Pooja Varshney
Assistant Professor, CEIT
Learning Objective
• Understand the architecture, purpose, and functionality of
distributed file systems in managing large-scale data.Big
• Learn the concept of Big Data, its significance, and how it
influences decision-making across industries.
• Identify the Four Vs (Volume, Velocity, Variety, Veracity) and
key drivers propelling the growth of Big Data.
• Explore analytics techniques and real-world applications of
Big Data in domains like healthcare, finance, and social media.
• Understand the MapReduce framework and implement
algorithms like Matrix-Vector Multiplication to process large
datasets efficiently.
Learning Outcome
• Understanding Distributed File Systems: Learners will be able to
explain the role and functionality of distributed file systems in
managing large datasets.
• Comprehending Big Data Concepts: Learners will articulate the
importance of Big Data and its impact on modern decision-making
processes.
• Analyzing the Four Vs and Drivers: Learners will evaluate the Four Vs of
Big Data and identify the key factors driving its growth and adoption.
• Applying Big Data Analytics: Learners will demonstrate the ability to
analyze Big Data and recognize its applications across various industries.
• Implementing MapReduce Algorithms: Learners will design and
execute algorithms, such as Matrix-Vector Multiplication, using the
MapReduce paradigm to solve complex data problems.
Distributed File System (DFS)

• A Distributed File System (DFS) as the name

suggests, is a file system that is distributed on
multiple file servers or multiple locations.
• It allows programs to access or store isolated
files as they do with the local ones, allowing
programmers to access files from any network
or computer.
Purpose
• to allows users of physically distributed
systems to share their data and resources by
using a Common File System.
• A collection of workstations and mainframes
connected by a Local Area Network (LAN).
• A DFS is executed as a part of the operating
system.
• In DFS, a namespace is created and this
process is transparent for the clients.
Components

• DFS has two components:

– Location Transparency –
Location Transparency achieves through the
namespace component.
• DFS Namespaces Enables you to group shared folders that
are located on different servers into one or more logically
structured namespaces.
– Each namespace appears to users as a single shared folder with a
series of subfolders.
– Redundancy –
Redundancy is done through a file replication
component.
File system replication
• Early iterations of DFS made use of Microsoft’s File
Replication Service (FRS), which allowed for
straightforward file replication between servers. The
most recent iterations of the whole file are distributed to
all servers by FRS, which recognises new or updated
files.
• “DFS Replication” was developed by Windows Server
2003 R2 (DFSR). By only copying the portions of files that
have changed and minimising network traffic with data
compression, it helps to improve FRS. Additionally, it
provides users with flexible configuration options to
manage network traffic on a configurable schedule.
Features of DFS
• Transparency :
– Structure transparency –
There is no need for the client to know about the number or locations
of file servers and the storage devices. Multiple file servers should be
provided for performance, adaptability, and dependability.
– Access transparency –
Both local and remote files should be accessible in the same manner.
The file system should be automatically located on the accessed file and
send it to the client’s side.
– Naming transparency –
There should not be any hint in the name of the file to the location of
the file. Once a name is given to the file, it should not be changed
during transferring from one node to another.
– Replication transparency –
If a file is copied on multiple nodes, both the copies of the file and their
locations should be hidden from one node to another.
Features of DFS
• User mobility :
It will automatically bring the user’s home directory to the node where the user
logs in.
• Performance :
Performance is based on the average amount of time needed to convince the
client requests.
– time =CPU time + time taken to access secondary storage + network access time
• Simplicity and ease of use :
The user interface of a file system should be simple and the number of
commands in the file should be small.
• High availability :
A Distributed File System should be able to continue in case of any partial
failures like a link failure, a node failure, or a storage drive crash.
A high authentic and adaptable distributed file system should have different and
independent file servers for controlling different and independent storage
devices.
Features of DFS
• Scalability :
Service should not be substantially disrupted as the number of nodes and users grows.
• High reliability :
The likelihood of data loss should be minimized as much as feasible in a suitable distributed
file system.
• Data integrity :
Multiple users frequently share a file system. The integrity of data saved in a shared file
must be guaranteed by the file system. That is, concurrent access requests from many users
who are competing for access to the same file must be correctly synchronized using a
concurrency control method. Atomic transactions are a high-level concurrency management
mechanism for data integrity that is frequently offered to users by a file system.
• Security :
A distributed file system should be secure so that its users may trust that their data will be
kept private. To safeguard the information contained in the file system from unwanted &
unauthorized access, security mechanisms must be implemented.
• Heterogeneity :
Heterogeneity in distributed systems is unavoidable as a result of huge scale. Users of
heterogeneous distributed systems have the option of using multiple computer platforms
for different purposes.
Applications :
• NFS –
NFS stands for Network File System. It is a client-server architecture
that allows a computer user to view, store, and update files remotely.
The protocol of NFS is one of the several distributed file system
standards for Network-Attached Storage (NAS).
• CIFS –
CIFS stands for Common Internet File System. CIFS is an accent of SMB.
That is, CIFS is an application of SIMB protocol, designed by Microsoft.
• SMB –
SMB stands for Server Message Block. It is a protocol for sharing a file
and was invented by IMB. The SMB protocol was created to allow
computers to perform read and write operations on files to a remote
host over a Local Area Network (LAN). The directories present in the
remote host can be accessed via SMB and are called as “shares”.
• Hadoop –
Hadoop is a group of open-source software services. It gives
a software framework for distributed storage and operating
of big data using the MapReduce programming model. The
core of Hadoop contains a storage part, known as Hadoop
Distributed File System (HDFS), and an operating part which
is a MapReduce programming model.
• NetWare –
NetWare is an abandon computer network operating system
developed by Novell, Inc. It primarily used combined
multitasking to run different services on a personal
computer, using the IPX network protocol.
Working of DFS :
There are two ways in which DFS can be implemented:
• Standalone DFS namespace –It allows only for those DFS
roots that exist on the local computer and are not using
Active Directory.
– A Standalone DFS can only be acquired on those computers on
which it is created.
– It does not provide any fault liberation and cannot be linked to any
other DFS.
– Standalone DFS roots are rarely come across because of their
limited advantage.
• Domain-based DFS namespace – It stores the configuration
of DFS in Active Directory, creating the DFS namespace root
accessible at \\<domainname>\<dfsroot> or \\<FQDN>\
<dfsroot>
• Advantages :
– DFS allows multiple user to access or store the data.
– It allows the data to be share remotely.
– It improved the availability of file, access time, and
network efficiency.
– Improved the capacity to change the size of the data
and also improves the ability to exchange the data.
– Distributed File System provides transparency of data
even if server or disk fails.
• Disadvantages :
– In Distributed File System nodes and connections needs to
be secured therefore we can say that security is at stake.
– There is a possibility of lose of messages and data in the
network while movement from one node to another.
– Database connection in case of Distributed File System is
complicated.
– Also handling of the database is not easy in Distributed
File System as compared to a single user system.
– There are chances that overloading will take place if all
nodes tries to send data at once.
MCQs
• What is a distributed file system (DFS)?
A. A file system that manages data on a single server
B. A file system that distributes and manages data across multiple servers
C. A file system limited to local file storage
D. A system for real-time data processing

• Which of the following is a primary advantage of a DFS?

A. Reduced data redundancy
B. Centralized control
C. Fault tolerance and scalability
D. Faster single-node access

• Which of the following is an example of a distributed file system?

A. NTFS
B. HDFS
C. FAT32
D. ext4

• What is the main role of a NameNode in HDFS?

A. Storing all the data
B. Managing metadata and file system namespace
C. Processing client requests
D. Reducing data duplication
• What ensures data availability in a distributed file system?
A. Metadata storage
B. Replication of data across nodes
C. Single-node backups
D. Virtual memory management

• Which of the following best describes the term "fault tolerance" in a DFS?
A. Ability to detect errors in file transfers
B. Ability to continue functioning despite node failures
C. The process of replicating data
D. Ensuring fast write operations

• What is data locality in a DFS?

A. Storing data on a single server
B. Keeping computation close to the data location to reduce latency
C. Distributing data evenly across all nodes
D. Centralizing data in a specific region

• Which protocol is commonly used for communication in a DFS?

A. HTTP
B. TCP/IP
C. SMTP
D. FTP
• How does a DFS achieve scalability?
A. By using centralized control
B. By adding more nodes to the system
C. By limiting the number of users
D. By optimizing single-node performance

• What is the role of a DataNode in HDFS?

A. Managing metadata and namespace
B. Storing actual data blocks
C. Monitoring the health of the file system
D. Scheduling tasks across nodes

• Which of the following challenges does a DFS address?

A. Limited storage capacity of individual machines
B. High latency in data processing
C. Ensuring consistent backups
D. Restricting concurrent user access

• In a DFS, what is metadata used for?

A. Storing the actual data
B. Providing information about data locations and structure
C. Managing user permissions
D. Compressing the data for storage
• What is a "block" in the context of a DFS?
A. A complete file stored on a single node
B. A small unit of data stored across nodes
C. A virtual storage unit on the cloud
D. An encrypted data packet

• Which component in HDFS monitors the health of DataNodes?

A. DataNode
B. NameNode
C. Secondary NameNode
D. Client Node

• What does the term "striping" mean in a DFS?

A. Encrypting data across nodes
B. Storing parts of a file across multiple nodes
C. Compressing data for efficient storage
D. Assigning file permissions to users
Answers
1. Answer: B
2. Answer: C
3. Answer: B
4. Answer: B
5. Answer: B
6. Answer: B
7. Answer: B
8. Answer: B
9. Answer: B
10. Answer: B
11. Answer: A
12. Answer: B
13. Answer: B
14. Answer: B
15. Answer: B
Reference Links

• https://
www.youtube.com/watch?v=c3loR2znLDI
• https://userweb.ucs.louisiana.edu/~
vvr3254/CMPS598/Notes/Matrix-Vector%20M
ultiplication%20by%20MapReduce-v2.pdf?ut
m_source=chatgpt.com
• https://www.databricks.com/glossary/hadoop
-distributed-file-system-hdfs
?
Reference Books
• "Hadoop: The Definitive Guide" by Tom White
• "Big Data: Principles and Best Practices of Scalable
Real-Time Data Systems" by Nathan Marz and James
Warren
• "Mining of Massive Datasets" by Jure Leskovec, Anand
Rajaraman, and Jeffrey Ullman
• "Big Data Analytics: From Strategic Planning to
Enterprise Integration with Tools, Techniques, NoSQL,
and Graph" by David Loshin
• "Data-Intensive Text Processing with MapReduce" by
Jimmy Lin and Chris Dyer
THANK YOU

Unit III
No ratings yet
Unit III
120 pages
Progress Test 7
60% (5)
Progress Test 7
2 pages
Lecture24 DFS PartI 25nov 2014
No ratings yet
Lecture24 DFS PartI 25nov 2014
46 pages
Unit 4 Distributed Systems
No ratings yet
Unit 4 Distributed Systems
35 pages
Discrete Computing
No ratings yet
Discrete Computing
25 pages
Unit-5.2 Distributed File System (DFS)
No ratings yet
Unit-5.2 Distributed File System (DFS)
29 pages
Distributed Systems: (3rd Edition)
No ratings yet
Distributed Systems: (3rd Edition)
36 pages
Shivajirao Kadam Institute of Technology and Management, Indore (M.P.)
No ratings yet
Shivajirao Kadam Institute of Technology and Management, Indore (M.P.)
13 pages
Distributed System DS Unit5
No ratings yet
Distributed System DS Unit5
61 pages
2 4DistributedFileSystem
No ratings yet
2 4DistributedFileSystem
19 pages
Title: Distributed File Systems
No ratings yet
Title: Distributed File Systems
9 pages
Title: Distributed File Systems
No ratings yet
Title: Distributed File Systems
9 pages
Lecture 5 - DFS & NFS
No ratings yet
Lecture 5 - DFS & NFS
45 pages
2.5 DFS
No ratings yet
2.5 DFS
14 pages
Module 2
No ratings yet
Module 2
27 pages
DC - Unit 3 Uhh Ybhg The G Hai H G BT
No ratings yet
DC - Unit 3 Uhh Ybhg The G Hai H G BT
32 pages
DC - PPT A Case Study On Distributed File Systems
No ratings yet
DC - PPT A Case Study On Distributed File Systems
17 pages
(DFS) Distributed File System-1
No ratings yet
(DFS) Distributed File System-1
12 pages
Applications of Distributed Systems
No ratings yet
Applications of Distributed Systems
35 pages
Distributed File Systems
No ratings yet
Distributed File Systems
6 pages
Dist Sys Unit 4 Notes
No ratings yet
Dist Sys Unit 4 Notes
45 pages
Navigating The Landscape of Distributed File Systems: Architectures, Implementations, and Considerations
No ratings yet
Navigating The Landscape of Distributed File Systems: Architectures, Implementations, and Considerations
10 pages
Concept of Distributed File System
No ratings yet
Concept of Distributed File System
10 pages
Distributed File System
No ratings yet
Distributed File System
7 pages
L8 DFS
No ratings yet
L8 DFS
35 pages
Lec 11 - Distributed Files - Distributed File System
No ratings yet
Lec 11 - Distributed Files - Distributed File System
33 pages
Distributed File Systems & Name Services: UNIT-4
No ratings yet
Distributed File Systems & Name Services: UNIT-4
70 pages
Distributed File Systems
No ratings yet
Distributed File Systems
35 pages
Distributed File System
No ratings yet
Distributed File System
27 pages
Distributed File Systems-2
No ratings yet
Distributed File Systems-2
4 pages
Distributed File Systems Concepts and e 61384
No ratings yet
Distributed File Systems Concepts and e 61384
54 pages
Distributed File System
No ratings yet
Distributed File System
43 pages
DC Mod 6
No ratings yet
DC Mod 6
9 pages
Chapter 8
No ratings yet
Chapter 8
22 pages
DC EXP8-1
No ratings yet
DC EXP8-1
5 pages
A Distributed File System: By, Prof Ankita Mandore
No ratings yet
A Distributed File System: By, Prof Ankita Mandore
37 pages
Module III Hadoop Framework
No ratings yet
Module III Hadoop Framework
21 pages
Distributed File System Questions and Answers
100% (1)
Distributed File System Questions and Answers
6 pages
Distributed File Systems
No ratings yet
Distributed File Systems
50 pages
What Is DFS
No ratings yet
What Is DFS
37 pages
CSCI319 Distributed Systems
No ratings yet
CSCI319 Distributed Systems
26 pages
A Comparative Study of The Architectures and Applications of Scalable High-Performance Distributed File Systems
No ratings yet
A Comparative Study of The Architectures and Applications of Scalable High-Performance Distributed File Systems
11 pages
ISE-Architecture Fundamentals-Security
No ratings yet
ISE-Architecture Fundamentals-Security
63 pages
Distributed File System
No ratings yet
Distributed File System
7 pages
DFSNov 1
No ratings yet
DFSNov 1
36 pages
7 A Taxonomy and Survey On Distributed File Systems
No ratings yet
7 A Taxonomy and Survey On Distributed File Systems
6 pages
Distributed File Systems: Pavel Bžoch
No ratings yet
Distributed File Systems: Pavel Bžoch
36 pages
Distributed File Systems
No ratings yet
Distributed File Systems
107 pages
9.2 Desirable Features of Good Distributed File System
No ratings yet
9.2 Desirable Features of Good Distributed File System
20 pages
DFS, PPT
No ratings yet
DFS, PPT
18 pages
Requirements For Distributed File Systems
No ratings yet
Requirements For Distributed File Systems
4 pages
Unit 3: Distributed File System
No ratings yet
Unit 3: Distributed File System
12 pages
Unit-3 (Bit-43)
No ratings yet
Unit-3 (Bit-43)
16 pages
CCNA Cheat Sheet
100% (1)
CCNA Cheat Sheet
1 page
Distributed File System
No ratings yet
Distributed File System
5 pages
Microsoft Azure Interview Important Question
No ratings yet
Microsoft Azure Interview Important Question
12 pages
Distributed File System - File Service Architecture
No ratings yet
Distributed File System - File Service Architecture
51 pages
Cloud Spanning: Multiple Environments
No ratings yet
Cloud Spanning: Multiple Environments
6 pages
2distributed File System Dfs
No ratings yet
2distributed File System Dfs
21 pages
Toshiba Satellite l310 Quanta Te1m Laptop Schematics
No ratings yet
Toshiba Satellite l310 Quanta Te1m Laptop Schematics
37 pages
HSS9860 V900R008C20 Data Configuration and O&M (LTE)
No ratings yet
HSS9860 V900R008C20 Data Configuration and O&M (LTE)
51 pages
Distributed File Systems
No ratings yet
Distributed File Systems
18 pages
Cloud Architect PDF
No ratings yet
Cloud Architect PDF
14 pages
Nokia 1830 PSI M Data Sheet EN 2024
No ratings yet
Nokia 1830 PSI M Data Sheet EN 2024
4 pages
User Manual of Wanscam
No ratings yet
User Manual of Wanscam
37 pages
Using Package Manager On Solaris
No ratings yet
Using Package Manager On Solaris
4 pages
Computer Networking: A Top Down Approach: A Note On The Use of These PPT Slides
No ratings yet
Computer Networking: A Top Down Approach: A Note On The Use of These PPT Slides
14 pages
FACT SHEET: Code of Practice For Telecommunication Service Resiliency 2008 ("Service Resiliency Code")
No ratings yet
FACT SHEET: Code of Practice For Telecommunication Service Resiliency 2008 ("Service Resiliency Code")
3 pages
Discussion Forum
No ratings yet
Discussion Forum
17 pages
Kubernetes Vs Docker A Quick Comparison
No ratings yet
Kubernetes Vs Docker A Quick Comparison
5 pages
Ss Guide 2
No ratings yet
Ss Guide 2
12 pages
08 Lecture CSC462
No ratings yet
08 Lecture CSC462
35 pages
4.time Synchronization With GPS: Wireless Sensor Networks
No ratings yet
4.time Synchronization With GPS: Wireless Sensor Networks
37 pages
Answers To Question Paper of STTPIoT-19
No ratings yet
Answers To Question Paper of STTPIoT-19
4 pages
Culture Unit 1 PDF
No ratings yet
Culture Unit 1 PDF
1 page
Nava Nalanda Central Library
No ratings yet
Nava Nalanda Central Library
3 pages
E-Governance Public Key Infrastructure (PKI) Model: International Journal of Electronic Governance January 2013
No ratings yet
E-Governance Public Key Infrastructure (PKI) Model: International Journal of Electronic Governance January 2013
11 pages
Viii Sem Im Data
No ratings yet
Viii Sem Im Data
2 pages
Approved Syllabus
No ratings yet
Approved Syllabus
9 pages
Introduction To Chatbots: Present By
No ratings yet
Introduction To Chatbots: Present By
10 pages
Flashwave 7420
No ratings yet
Flashwave 7420
7 pages
Acadamic Ragistration Form Even 2025
No ratings yet
Acadamic Ragistration Form Even 2025
3 pages
Cognos Report Net (CRN) Cognos Connection
No ratings yet
Cognos Report Net (CRN) Cognos Connection
15 pages
WLS Top10 Concepts
No ratings yet
WLS Top10 Concepts
24 pages
Control LEDs Through A Web Page With An Arduino
No ratings yet
Control LEDs Through A Web Page With An Arduino
13 pages
Hub Switch Bridge Router
No ratings yet
Hub Switch Bridge Router
5 pages
Logs
No ratings yet
Logs
4 pages
Optical Fiber Analog and Digital Links
No ratings yet
Optical Fiber Analog and Digital Links
4 pages
Genre: Elearning: Java J2Ee Complete: 16-Level Instructor-Based Video Training Set
No ratings yet
Genre: Elearning: Java J2Ee Complete: 16-Level Instructor-Based Video Training Set
2 pages
External Practical Time Table Odd Semester-2024
No ratings yet
External Practical Time Table Odd Semester-2024
1 page

Rev. Lecture 1 PPT2

Uploaded by

Rev. Lecture 1 PPT2

Uploaded by

Big Data Analysis(CP 420)

• A Distributed File System (DFS) as the name

• DFS has two components:

• Which of the following is a primary advantage of a DFS?

• Which of the following is an example of a distributed file system?

• What is the main role of a NameNode in HDFS?

• What is data locality in a DFS?

• Which protocol is commonly used for communication in a DFS?

• What is the role of a DataNode in HDFS?

• Which of the following challenges does a DFS address?

• In a DFS, what is metadata used for?

• Which component in HDFS monitors the health of DataNodes?

• What does the term "striping" mean in a DFS?

You might also like