0% found this document useful (0 votes)
26 views12 pages

Docs Template

The document discusses several distributed file systems and parameters like fault tolerance, replication, checkpointing, and security that were studied using MapReduce. It describes tools like Ceph, GlusterFS, HDFS and Lustre that were used in the study. Lessons learned from case studies on application integrity, decomposition and file system integrity are also covered.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views12 pages

Docs Template

The document discusses several distributed file systems and parameters like fault tolerance, replication, checkpointing, and security that were studied using MapReduce. It describes tools like Ceph, GlusterFS, HDFS and Lustre that were used in the study. Lessons learned from case studies on application integrity, decomposition and file system integrity are also covered.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

DHVSU LUBAO CAMPUS

Wireframe Documentation: Critical Study of


Performance Parameters on Distributed File
Systems using MapReduce [Title of your
Wireframe]

In Partial Fulfillment
of the Requirements for the Course of
Human Computer Interaction and SIA

John Bernard C. Tungol

April 10, 2021


DHVSU LUBAO CAMPUS

Abstract

Every day, there is a lot of data generated by the network. MapReduce is a good
programming model that is parallel for large processing of data. The paper that was
reviewed surveyed several distributed storage and distributed systems. Various
parameters were studied. These are Fault Tolerance, Replication, Checkpointing, Security
and Optimizing small file access using MapReduce and distributed file systems that were
reviewed were Ceph, GlusterFS, HDFS and Lustre. These distributed file systems were
open source. Also, one of the important distributed file systems that plays an important
role in protecting application data is Cloud Computing. The authors said that their paper
is efficient and scalable because of MapReduce application.

Introduction and Discussions

There is a lot of data generated from the network is growing every day. Massive data
processing and inadequacy in storage of traditional databases is observed well. A
distributed file system stores data on multiple nodes, and to remove the bottleneck.
Clients are allowed to access data in parallel from the storage nodes. Furthermore, DFSs
are sometimes called network. Distributed file systems provide persistent storage of
unstructured data, which are organized in a hierarchical namespace of files that is shared
among networked nodes. This data model and the interface to applications distinguishes
distributed file systems from other types such as databases. To applications, it should be
transparent whether data is on a distributed file system or stored on a local file system.

Motivation and Context


DHVSU LUBAO CAMPUS

Even though the file system interface is general and fits a broad spectrum of
applications, most distributed file system implementations are optimized for applications
particularly. There is often a need to survey and study the distributed systems. These
cases differ qualitatively and quantitatively. Every use case poses high requirements in
only some of the dimensions. All of the use cases combined, however, would require a
distributed file system in every dimension which the performance must be outstanding.
Some systems requirements contradict each other: a high-level redundancy (e.g., for
recorded experiment data) reduces inevitably the right throughput in cases where
redundancy is not needed (e.g., for a scratch area). The file system interface has no
standard to provide a way to specify quality of service properties for particular directories
or files.

Related Work

There are so many related works that is related or synonymous to the paper that is
about to be reviewed. Among these are the following published papers: (1) Cassandra
File System Over Hadoop Distributed File System by Mr. Ashish A. Mutha and Miss.
Vaishali M. Deshmukh, both from PRMIT&R College, Amravati, India. (2) A Survey on
Distributed File System Technology by J Blomer of Switzerland. These published studies
were recommended well and being used as reference respectively by the succeeding
researchers.

Overview of Modelling Method

Fault Tolerance and Replication


DHVSU LUBAO CAMPUS

Data is distributed over multiple machines where there are chances of failure of networks.
A simple distributed file system is needed with integrated fault tolerance for efficient
handling of small records of data. Fault-tolerant data storage is becoming popular as
moving data is done to the cloud. Actually, fault tolerance is achieved through the
division of a file into smaller chunks or fragments, which are processed and managed by
a set of servers. Fault-tolerance is an important aspect in cloud storage where a major
concern is the robustness of data.

Checkpointing

Checkpointing is an essential fault tolerance mechanism adopted by long-running


intensive data applications. It occurs at regular intervals; applications undergo which
methods are to compute and to checkpoint operations. The process of Periodical
Checkpointing saves the state of the application. This process is done frequently by
verifying the health of the cluster to check the progress. It has been observed that the
checkpointing is done in various approach; on cloud of free disk space, or on central file
server parallel file system, or temporary buffer.

Dealing with Small Files

It has been presented in a log-structured file system, authors claim that such a system
exhibits a performance increase of an order of magnitude for small-file writes, while
matching the performance for large files in comparison to non-log-structured file
systems. Managing free space is an important issue, since the problem is how to make
large extents of free space available for writing after many overwrites and deletions of
small files. The object-based storage systems like Ceph for example, has the data
organized and can be accessed, struggle with workloads that access large number of small
files, which are developed through user workspaces and software, there are two reasons
DHVSU LUBAO CAMPUS

for this: loss of namespace locality at the storage devices and interactions of each file
with metadata server.

Security

Recent efforts recognize the importance of self-protection of such big data management
systems, but they mainly focus on correctness and data privacy. The increasing popularity
of storing analytics and big data create a need of efficient and secure data management
mechanism. One of the most relevant security topics for handling such big data refers to
preventing the users from damaging the stored data or from breaking data-access
protocols and security policies. For the security of large data, techniques such as logging,
encryption and privacy techniques etc. are necessary. Distributed file system is one of the
cloud computing systems, security issues of these systems and technologies are
applicable to cloud computing. IBM researchers prefer that using Kerberos as an
application to secure its data environment.

Tool Support

These are the tools that support the study of Distributed File System using
MapReduce framework.

Ceph

Ceph is free, easy to use, and reliable. Its power allows to manage vast amounts of data.
Ceph delivers extraordinary scalability where thousands of clients can access petabytes to
exabytes of data. A ceph node can work smoothly on commodity hardware, which
DHVSU LUBAO CAMPUS

accommodates large numbers of nodes, which can communicate with each other to
dynamically redistribute and replicate data. the placement policies can separate the object
replicas across different failure domains but still maintains the desired distribution using
the CRUSH (Controlled Replication Under Scalable Hashing) algorithm.

GlusterFS

In GlusterFS, the elementary storage units are called as Bricks. A server can have more
than one bricks where they can store data through translators on lower-level file systems.
GlusterFS distributes load using a distribute hash translation (DHT) of filenames to its
subvolumes, which are duplicated to provide fault tolerance and load handling on scale-
out distributed file system supporting thousands of clients.

Lustre

Lustre is a file system having high performance computing (HPC) ability and has an
ability to process Big Data. Lustre is a cluster file system based on client/server model.
Lustre file system achieves great performances and scalability as the separation in
metadata operations is seen from normal data operations. Data is stored on Object
Storage Servers (OSSs) and metadata is stored on Metadata Servers (MDSs). Lustre and
Hadoop’s Distributed File System (HDFS) have similarity in terms of performance and
storage capabilities.

Industrial Case Study and Lesson Learned


DHVSU LUBAO CAMPUS

Application Integrity

Some distributed file systems are implemented as an extension of the operating


system kernel (e. g. NFS, AFS, Lustre). That can provide better performance compared to
interposition systems but the deployment is difficult and implementation errors typically
crash the operating system kernel. Distributed file systems do not fully comply with the
POSIX file system standard. Each distributed file system needs to be tested with real
applications. From the point of view of applications there are different levels of
integration a distributed file system can provide.

Decomposition

There is a tendency of decomposition and modularization in distributed file systems.


In the grid, for instance, federates globally distributed cluster file systems, the namespace
is controlled by experiments’ file catalogs, which, in combination with grid middleware.
Examples are the offloading of authorization to Kerberos in AFS, the offloading of
distributed consensus to Chubby in GFS (resp. ZooKeeper in HDFS), or the layered
implementation of Ceph with the independent RADOS key-value store as building block
beneath the file system.

File System Integrity

Cryptographic hashes of the content of files are often used to ensure data integrity.
Cryptographic hashes provide a short, constant length, unique identifier for data of any
size. Collisions are virtually impossible to occur neither by chance nor by clever crafting,
which makes cryptographic hashes a means to protect against data tampering. It also
results in immutable data, which is keeping cache consistency and eliminates the problem
of detecting stale cache entries. Furthermore, redundant data and duplicated files are
DHVSU LUBAO CAMPUS

automatically de-duplicated, which in some use cases (backups, scientific software


binaries) reduces the actual storage space utilization by many factors.

Efficiency

Caching and file striping are standard techniques to improve the speed of distributed file
systems. Caches can be located in memory, in flash memory, or on hard disks. Caches are
most often managed per file system node. Cache sizes needs to be manually tuned
according to the working set size of applications. Co-operative caches between nodes in a
local network have been discussed but they are not implemented in today’s production
file systems. Dynamic workload adaptation is a technique used in the Ceph file system to
change the metadata-to-metadata mapping of servers based on the server load.

Highlights

Title and Content

Critical Study of Performance Parameters on Distributed File Systems using


MapReduce.

Handling Large Data - Both MapReduce and Parallel DBMS provide a means to process
large volumes of data. As the volume of data captured continues to rise, questions have
been asked as to whether the parallel DBMS paradigm can scale to meet demands.
Parallel DBMS were developed to improve the performance of database systems. As
there is an improvement in processor performances, it has been outstripped disk
throughput, hence critics have predicted from time to time that I/O bottleneck would be a
DHVSU LUBAO CAMPUS

major problem. MapReduce has been designed inherently fault tolerant and is to run on
thousands of nodes.

Analytics - The output from one subprocess is the input to the next such algorithms are
difficult to implement in SQL. Performing these tasks in many steps reduces the
performance benefits gained from parallel DBMS. Both MapReduce and Parallel DBMS
can be used to produce analytical results from big data.

Replication - An engineering challenge is the placement of redundant data in such a way


that the redundancy crosses multiple failure domains. While replication is simple and
fast, it also results in a large storage overhead. Replication and erasure codes are the
techniques used to avoid data loss and to continue operation in case of hardware failures.

Impact

 MapReduce can be transparently scalable. The underlying hardware has no


dependencies. The user does not need to manage data placement or the number of
nodes used for their job.

 Because processing is independent, failover is trivial. A failed process can be


restarted, provided the underlying filesystem is redundant like HDFS.

 Data flow is highly defined and in one direction from the Map to the Reduce, with
no communication between independent mapper or reducer processes.

 MapReduce though powerful, does not fit all problem types.


DHVSU LUBAO CAMPUS

Strengths

Ceph is unable to provide coherent file chunk replicas and thus is bandwidth limited.
A computational paradigm named MapReduce, where an application is divided into many
small fragments of work, each of which may be executed on any node in the cluster. The
cluster can be HPC or the client can be a node on distributed file system or centralized
system. The data is replicated, making it fault tolerant. Many of researchers have written
that MapReduce can be run on Glusterfs and will give better performance than HDFS.

Weaknesses

In the upcoming years, the computing landscape will move towards the exascale.
That means data sets that routinely sum up to exabytes and supercomputers that provide
computing power in the exaflop range. While the gap between capacity and bandwidth
widened by one to two orders of magnitude in the last 20 years, the bandwidth of
Ethernet networks scaled at a similar pace than the capacity of hard drives. Raicu et al.
predict the collapse of exaflop supercomputing applications due to the limited storage
bandwidth and the architecture of todays distributed file systems. Experts suggests to
break the segregation between computer networks and storage networks and to build
distributed file systems with the following characteristics. The desire to integrate
MapReduce on the distributed node or any HPC cluster will be the part of the future
study and implementations.
DHVSU LUBAO CAMPUS

Conclusions

Distributed file systems provide a relatively well-defined and general-purpose interface


for applications to use large-scale persistent storage. Its implementation of distributed file
systems, however, is always tailored to a particular class of applications. To conclude,
MapReduce is a programming model for processing large data sets with a distributed,
parallel algorithm on a cluster. The choice of backing file system or cluster on HPC, and
the placement and data security concern, MapReduce supports the data availability.
However, comparative testing on a much larger and wider scale would be undertaken in
the future.

Storyboard

References

1. Sandberg R, Goldberg D, Kleiman S, Walsh D and Lyon B 1985 Proc. of the


Summer USENIX conference pp 119–130
2. Morris J H, Satyanarayanan M, Conner M H, Howard J H, Rosenthal D S H and
Smith F D 1986 Communications of the ACM 29 184–201
3. Carns P H, III W B L, Ross R B and Thakur R 2000 Proc. 4th Annual Linux
Showcase and Conference (ALS’00) pp 317–328
4. Schmuck F and Haskin R 2002 Proc. 1st USENIX conf. on File Storage and
Technologies (FAST’02) pp 231–244
5. “Designing performance monitoring tool for NoSQL Cassandra distributed
database”. http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6360579
&queryText%3DCASSANDRA, 2012 IEEE.
6. “Cassandra: flexible trust management, applied to electronic health records”,
http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=1310738 &queryText
%3DCASSANDRA, IEEE.
7. David A. Patterson, Garth A. Gibson, and Randy H. Katz. A Case for Redundant
Arrays of Inexpensive Disks (RAID). In : Proceedings of the 1985 ACM SIGMOD
International Conference on Management of Data;1988,pp. 109–116.
DHVSU LUBAO CAMPUS

8. T. Kosar, Data Intensive Distributed Computing: Challenges and Solutions for Large
Scale Information Management, IGI Publications;2011.
9. A.Silberschatz,P.Baer Galvin,G.Gagne, Operating Systems , Publication By John
Wiley & Sons.
10. L. Lamport. Time, clocks, and the ordering of events in a distributed system.
Commun. ACM;1978.
11. J.Bent,G.Gibson,G.Ben McClell,P.Nowoczynski, J. Nunez, M.Polte, M.
Wingat,PLFS: A Checkpoint Filesystem for Parallel

You might also like