0% found this document useful (0 votes)
31 views24 pages

Rev. Lecture 1 PPT2

Uploaded by

Pro Fessor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views24 pages

Rev. Lecture 1 PPT2

Uploaded by

Pro Fessor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 24

Big Data Analysis(CP 420)

UNIT 1
Introduction – distributed file system
Lecture 1

Presented By:
Pooja Varshney
Assistant Professor, CEIT
Learning Objective
• Understand the architecture, purpose, and functionality of
distributed file systems in managing large-scale data.Big
• Learn the concept of Big Data, its significance, and how it
influences decision-making across industries.
• Identify the Four Vs (Volume, Velocity, Variety, Veracity) and
key drivers propelling the growth of Big Data.
• Explore analytics techniques and real-world applications of
Big Data in domains like healthcare, finance, and social media.
• Understand the MapReduce framework and implement
algorithms like Matrix-Vector Multiplication to process large
datasets efficiently.
Learning Outcome
• Understanding Distributed File Systems: Learners will be able to
explain the role and functionality of distributed file systems in
managing large datasets.
• Comprehending Big Data Concepts: Learners will articulate the
importance of Big Data and its impact on modern decision-making
processes.
• Analyzing the Four Vs and Drivers: Learners will evaluate the Four Vs of
Big Data and identify the key factors driving its growth and adoption.
• Applying Big Data Analytics: Learners will demonstrate the ability to
analyze Big Data and recognize its applications across various industries.
• Implementing MapReduce Algorithms: Learners will design and
execute algorithms, such as Matrix-Vector Multiplication, using the
MapReduce paradigm to solve complex data problems.
Distributed File System (DFS)

• A Distributed File System (DFS) as the name


suggests, is a file system that is distributed on
multiple file servers or multiple locations.
• It allows programs to access or store isolated
files as they do with the local ones, allowing
programmers to access files from any network
or computer.
Purpose
• to allows users of physically distributed
systems to share their data and resources by
using a Common File System.
• A collection of workstations and mainframes
connected by a Local Area Network (LAN).
• A DFS is executed as a part of the operating
system.
• In DFS, a namespace is created and this
process is transparent for the clients.
Components

• DFS has two components:


– Location Transparency –
Location Transparency achieves through the
namespace component.
• DFS Namespaces Enables you to group shared folders that
are located on different servers into one or more logically
structured namespaces.
– Each namespace appears to users as a single shared folder with a
series of subfolders.
– Redundancy –
Redundancy is done through a file replication
component.
File system replication
• Early iterations of DFS made use of Microsoft’s File
Replication Service (FRS), which allowed for
straightforward file replication between servers. The
most recent iterations of the whole file are distributed to
all servers by FRS, which recognises new or updated
files.
• “DFS Replication” was developed by Windows Server
2003 R2 (DFSR). By only copying the portions of files that
have changed and minimising network traffic with data
compression, it helps to improve FRS. Additionally, it
provides users with flexible configuration options to
manage network traffic on a configurable schedule.
Features of DFS
• Transparency :
– Structure transparency –
There is no need for the client to know about the number or locations
of file servers and the storage devices. Multiple file servers should be
provided for performance, adaptability, and dependability.
– Access transparency –
Both local and remote files should be accessible in the same manner.
The file system should be automatically located on the accessed file and
send it to the client’s side.
– Naming transparency –
There should not be any hint in the name of the file to the location of
the file. Once a name is given to the file, it should not be changed
during transferring from one node to another.
– Replication transparency –
If a file is copied on multiple nodes, both the copies of the file and their
locations should be hidden from one node to another.
Features of DFS
• User mobility :
It will automatically bring the user’s home directory to the node where the user
logs in.
• Performance :
Performance is based on the average amount of time needed to convince the
client requests.
– time =CPU time + time taken to access secondary storage + network access time
• Simplicity and ease of use :
The user interface of a file system should be simple and the number of
commands in the file should be small.
• High availability :
A Distributed File System should be able to continue in case of any partial
failures like a link failure, a node failure, or a storage drive crash.
A high authentic and adaptable distributed file system should have different and
independent file servers for controlling different and independent storage
devices.
Features of DFS
• Scalability :
Service should not be substantially disrupted as the number of nodes and users grows.
• High reliability :
The likelihood of data loss should be minimized as much as feasible in a suitable distributed
file system.
• Data integrity :
Multiple users frequently share a file system. The integrity of data saved in a shared file
must be guaranteed by the file system. That is, concurrent access requests from many users
who are competing for access to the same file must be correctly synchronized using a
concurrency control method. Atomic transactions are a high-level concurrency management
mechanism for data integrity that is frequently offered to users by a file system.
• Security :
A distributed file system should be secure so that its users may trust that their data will be
kept private. To safeguard the information contained in the file system from unwanted &
unauthorized access, security mechanisms must be implemented.
• Heterogeneity :
Heterogeneity in distributed systems is unavoidable as a result of huge scale. Users of
heterogeneous distributed systems have the option of using multiple computer platforms
for different purposes.
Applications :
• NFS –
NFS stands for Network File System. It is a client-server architecture
that allows a computer user to view, store, and update files remotely.
The protocol of NFS is one of the several distributed file system
standards for Network-Attached Storage (NAS).
• CIFS –
CIFS stands for Common Internet File System. CIFS is an accent of SMB.
That is, CIFS is an application of SIMB protocol, designed by Microsoft.
• SMB –
SMB stands for Server Message Block. It is a protocol for sharing a file
and was invented by IMB. The SMB protocol was created to allow
computers to perform read and write operations on files to a remote
host over a Local Area Network (LAN). The directories present in the
remote host can be accessed via SMB and are called as “shares”.
• Hadoop –
Hadoop is a group of open-source software services. It gives
a software framework for distributed storage and operating
of big data using the MapReduce programming model. The
core of Hadoop contains a storage part, known as Hadoop
Distributed File System (HDFS), and an operating part which
is a MapReduce programming model.
• NetWare –
NetWare is an abandon computer network operating system
developed by Novell, Inc. It primarily used combined
multitasking to run different services on a personal
computer, using the IPX network protocol.
Working of DFS :
There are two ways in which DFS can be implemented:
• Standalone DFS namespace –It allows only for those DFS
roots that exist on the local computer and are not using
Active Directory.
– A Standalone DFS can only be acquired on those computers on
which it is created.
– It does not provide any fault liberation and cannot be linked to any
other DFS.
– Standalone DFS roots are rarely come across because of their
limited advantage.
• Domain-based DFS namespace – It stores the configuration
of DFS in Active Directory, creating the DFS namespace root
accessible at \\<domainname>\<dfsroot> or \\<FQDN>\
<dfsroot>
• Advantages :
– DFS allows multiple user to access or store the data.
– It allows the data to be share remotely.
– It improved the availability of file, access time, and
network efficiency.
– Improved the capacity to change the size of the data
and also improves the ability to exchange the data.
– Distributed File System provides transparency of data
even if server or disk fails.
• Disadvantages :
– In Distributed File System nodes and connections needs to
be secured therefore we can say that security is at stake.
– There is a possibility of lose of messages and data in the
network while movement from one node to another.
– Database connection in case of Distributed File System is
complicated.
– Also handling of the database is not easy in Distributed
File System as compared to a single user system.
– There are chances that overloading will take place if all
nodes tries to send data at once.
MCQs
• What is a distributed file system (DFS)?
A. A file system that manages data on a single server
B. A file system that distributes and manages data across multiple servers
C. A file system limited to local file storage
D. A system for real-time data processing

• Which of the following is a primary advantage of a DFS?


A. Reduced data redundancy
B. Centralized control
C. Fault tolerance and scalability
D. Faster single-node access

• Which of the following is an example of a distributed file system?


A. NTFS
B. HDFS
C. FAT32
D. ext4

• What is the main role of a NameNode in HDFS?


A. Storing all the data
B. Managing metadata and file system namespace
C. Processing client requests
D. Reducing data duplication
• What ensures data availability in a distributed file system?
A. Metadata storage
B. Replication of data across nodes
C. Single-node backups
D. Virtual memory management

• Which of the following best describes the term "fault tolerance" in a DFS?
A. Ability to detect errors in file transfers
B. Ability to continue functioning despite node failures
C. The process of replicating data
D. Ensuring fast write operations

• What is data locality in a DFS?


A. Storing data on a single server
B. Keeping computation close to the data location to reduce latency
C. Distributing data evenly across all nodes
D. Centralizing data in a specific region

• Which protocol is commonly used for communication in a DFS?


A. HTTP
B. TCP/IP
C. SMTP
D. FTP
• How does a DFS achieve scalability?
A. By using centralized control
B. By adding more nodes to the system
C. By limiting the number of users
D. By optimizing single-node performance

• What is the role of a DataNode in HDFS?


A. Managing metadata and namespace
B. Storing actual data blocks
C. Monitoring the health of the file system
D. Scheduling tasks across nodes

• Which of the following challenges does a DFS address?


A. Limited storage capacity of individual machines
B. High latency in data processing
C. Ensuring consistent backups
D. Restricting concurrent user access

• In a DFS, what is metadata used for?


A. Storing the actual data
B. Providing information about data locations and structure
C. Managing user permissions
D. Compressing the data for storage
• What is a "block" in the context of a DFS?
A. A complete file stored on a single node
B. A small unit of data stored across nodes
C. A virtual storage unit on the cloud
D. An encrypted data packet

• Which component in HDFS monitors the health of DataNodes?


A. DataNode
B. NameNode
C. Secondary NameNode
D. Client Node

• What does the term "striping" mean in a DFS?


A. Encrypting data across nodes
B. Storing parts of a file across multiple nodes
C. Compressing data for efficient storage
D. Assigning file permissions to users
Answers
1. Answer: B
2. Answer: C
3. Answer: B
4. Answer: B
5. Answer: B
6. Answer: B
7. Answer: B
8. Answer: B
9. Answer: B
10. Answer: B
11. Answer: A
12. Answer: B
13. Answer: B
14. Answer: B
15. Answer: B
Reference Links

• https://
www.youtube.com/watch?v=c3loR2znLDI
• https://userweb.ucs.louisiana.edu/~
vvr3254/CMPS598/Notes/Matrix-Vector%20M
ultiplication%20by%20MapReduce-v2.pdf?ut
m_source=chatgpt.com
• https://www.databricks.com/glossary/hadoop
-distributed-file-system-hdfs
?
Reference Books
• "Hadoop: The Definitive Guide" by Tom White
• "Big Data: Principles and Best Practices of Scalable
Real-Time Data Systems" by Nathan Marz and James
Warren
• "Mining of Massive Datasets" by Jure Leskovec, Anand
Rajaraman, and Jeffrey Ullman
• "Big Data Analytics: From Strategic Planning to
Enterprise Integration with Tools, Techniques, NoSQL,
and Graph" by David Loshin
• "Data-Intensive Text Processing with MapReduce" by
Jimmy Lin and Chris Dyer
THANK YOU

24

You might also like