0% found this document useful (0 votes)

12 views

Critical Study of Performance Parameters On Distributed File

The document discusses distributed file systems and how MapReduce can be used for data processing on these systems. It analyzes parameters like fault tolerance, replication, checkpointing and security issues. It also compares different distributed file systems and how they use MapReduce.

Uploaded by

John Bernard Tungol

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views

Critical Study of Performance Parameters On Distributed File

Uploaded by

John Bernard Tungol

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Available online at www.sciencedirect.

com

ScienceDirect
Procedia Computer Science 78 (2016) 224 – 232

International Conference on Information Security & Privacy (ICISP2015), 11-12 December 2015,
Nagpur, INDIA

Critical Study of Performance Parameters on Distributed File

Systems using MapReduce
Madhavi Vaidyaa , Dr Shrinivas Deshpandeb
a
Department of Computer Science, VES College, Mumbai, India
b
Department of Computer Science, HVPM, Amravati,India

Abstract

There is a lot of data generated by the network is growing every day. MapReduce is a promising parallel programming model for
processing large data.In this paper we surveyed several distributed storage and computation systems. We have studied various
parameters like Fault Tolerance,Replication,Checkpointing,Security and Optimizing small file access using MapReduce and
reviewed the distributed file systems like GlusterFS,Lustre,Ceph and HDFS which are open source distributed file systems.
Cloud computing is one of the distirbuted file system which plays a very important role in protecting applications’data and the
related infrastructure with the help of policies, technologies, controls, and big data tools. In our study we have proposed that
MapReduce is the efficient and scalable programming platform for data processing which provides computational capabilities
and distributed storage on clusters of commodity hardware.
© 2016 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license
© 2016 The Authors. Published by Elsevier B.V.
(http://creativecommons.org/licenses/by-nc-nd/4.0/).
Peer-reviewunder
Peer-review underresponsibility
responsibility
of of organizing
organizing committee
committee of the
of the ICISP2015.
ICISP2015
Keywords:Lustre; Ceph; MapReduce;Small;HDFS;Securtiy;Replication

1. Introduction

There is a lot of data generated from the network is growing every day. In the face of massive data processing and
inadequacy in storage of traditional centralized relational databases is observed. A distributed file system(DFS)
stores data on multiple nodes and to remove this bottleneck the clients are allowed to access data in parallel from
storage nodes. Significant challenges for such a distributed file system are extended to a large number of storage
nodes and providing reasonably degraded operations when there are chances of hardware failure1. There is often a
need to study and survey the distributed systems.In this section we survey several distributed storage systems.

1877-0509 © 2016 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license
(http://creativecommons.org/licenses/by-nc-nd/4.0/).
Peer-review under responsibility of organizing committee of the ICISP2015
doi:10.1016/j.procs.2016.02.037
Madhavi Vaidya and Shrinivas Deshpande / Procedia Computer Science 78 (2016) 224 – 232 225

________

*Madhavi Vaidya. Tel.: +91-9869026553.

E-mail address: [email protected]

There is often a need to study and survey the distributed systems.In this section we survey several distributed
storage systems.

Furthermore, DFSs are sometimes called network, cluster, grid, cloud or parallel file systems among others have
been studied in this paper. Cloud computing is also a distributed file system which handles Terabytes and even
Petabytes level massive data2. They are Ceph File System, Gluster File System, Lustre File system and finally
HDFS have been elaborated which use the MapReduce for data processing.

Even this paper discusses security and privacy issues of various distributed file systems.There are other file systems
like Quantcast and NoFS are also the part of our research which support the MapReduce extensively but because of
the space constraint, they have not been studied here. The above four file systems are selected after a literature
survey as a part of our research for distributed file systems and selected for their popularity in usage.The authors
have taken all the above file systems where comparison of Parallel DBMS and MapReduce is done and which
mechanism is efficient to process data is exemplified. The placement of the paper is done as below. Firstly,by
reviewing the research, the parameters like Fault Tolerance, Replication and CheckPointing,Dealing with Small
Files and Security have been discussed. In the later part of this paper, on the basis of said parameters,Parallel DBMS
and MapReduce are the two mechanisms which have been compared in short. Towards the end of the paper, file
systems are compared on the basis of use of MapReduce on DFS and finally comparison has been done on the basis
of the said parameters.

2. Study of Parameters

2.1. Fault Tolerance and Replication

A simple distributed file system is needed with integrated fault tolerance for efficient handling of small data
records.Data is distributed over multiple machines where there are chances of network failures. Actually, Fault
tolerance is achieved through the division of a file into smaller fragments or chunks, which are managed and
processed by a set of servers. Fault-tolerant data storage is becoming popular as moving data is done to the cloud.
Fault-tolerance is an important aspect in cloud storage where robustness of data is a major concern3. Now the
question is how to find the criteria that can be used for comparison of various distributed file systems which are fault
tolerant in nature. These criteria are listed down, but not limited, to aspects such as (1) The ability to recover the
multiple machines from concurrent loss, (2) During the read or write operation on a file the ability to handle
interruptions or (3) The time taken to recover and to make the node synchronized from the network failure4. Data
replication is used to prevent data availability problems. In the face of node failures, following techniques are used
e.g., mirrored disks, data clustering, disk arrays with redundant check information, and chained declustering5.If one
node is down, then the table partitions stored on that node can be retrieved from other nodes. Several techniques
have been proposed for data availability. File chunks are stored on servers in two groups, they are nothing but write
and replica. The former group is composed of servers in charge of receiving or providing fragments(chunks) of file,
and is capable of reading and writing operations. The later group is composed of replicas of the first group and
allows only reading operations of file chunks6. MapReduce avoids the data transfer and overhead of checkpointing
in comparison with traditional MPI applications. It adopts multiple replicas of each data block and re-executes the
failed tasks if the failures occurs7 .
226 Madhavi Vaidya and Shrinivas Deshpande / Procedia Computer Science 78 (2016) 224 – 232

2.2. CheckPointing

Checkpointing is an indispensable fault tolerance mechanism adopted by long running data intensive applications. It
occurs at regular intervals, applications undergo two methods which are compute and checkpoint operations.
Checkpointing is a write I/O intensive operation. If the failure occurs, checkpoint data is often written once and
many a times read only8. Generally Checkpointing is done , on local-node storage and on shared file system. But the
locally stored data is lost when the node gets crashed. Second is compute nodes can also check point to a
shared,central file server. It has been stated as in9 shared file servers are crowded with I/O requests and have limited
space. However, flooding the central server is possible by the hundred nodes in a cluster with simultaneous check
pointing I/O operations. There are chances that a parallel file system like Gluster and Luster offer high I/O
throughput. Some authors have studied and stated that checkpointing though critical for keeping the reliable data
intact, it can be the overhead for the time spent on useful computations. It has been observed that the checkpointing
is done in various manners; on central file server parallel file system or on cloud of free disk space or temporary
buffer. The process of Periodical Checkpointing saves the state of the application. This process is done frequently by
verifying the health of the cluster to check the progress. If the failure occurs, then the process can be restarted from
most recent checkpoint. Multiple asynchronous checkpoints may create a difficulty in reconstructing a consistent
image, HPC applications checkpoint synchronously (i.e. Following a barrier). When the synchronous checkpointing
occurs, it merely shifts the control to the parallel file system which coordinates the simultaneous access from many
more compute nodes and hence there are chances of failures10. In this paper, stdchk11 saves checkpoints into a cloud
of free disk space or temporary buffer12 collected from a cluster/compute node which is made up of workstations
and is utilized. Diskless checkpointing13 saves checkpoints into compute node memory, but does not transfer to
persistent storage. The fault protection is achieved on the checkpoint image at distributed place by erasure coding.
Large HPC parallel applications jealously utilize all memory and demand a high degree of determinism in order to
avoid jitter14 and seen to be poorly served by techniques reducing available memory to execute background
processes during computation. MapReduce takes care of checkpointing to detect the failure node in a pre-defined
timeout parameter which has been observed in the following phases. If a node does not respond after a specified
timeout period,then that node is asserted to be dead. MapReduce executes in three stages: (1) The immediate results
are saved to the local storage as soon as the Map task finishes; (2) The local results are transferred to the Reduce
task in shuffle stage; (3) The results are saved in distributed file system once the process of Reduce task is done15.

2.3. Dealing with Small Files

The object-based storage systems like Ceph has the data organized and can be accessed, struggle with workloads
that access large number of small files,which are developed through software and user workspaces, there are two
reasons for this: interactions of each file with metadata server and loss of namespace locality at the storage devices.
For a client accessing the contents of a large file, the interaction usually creates a minor overhead done over many
data accesses. For a small file, on the other hand, there can be as few as one data access. Since accessing each file’s
data requires first interacting with the metadata server, client latency gets doubled and the metadata server can be
asked to service as many requests as all the storage devices combined16 . Accessing a large number of small files on
a parallel file system shifts the I/O challenge17 from providing high aggregate I/O throughput for supporting highly
concurrent tasks. It has been presented in18 ,a log-structured file system,authors claim that such a system exhibits a
performance increase of an order of magnitude for small-file writes, while matching the performance for large files
in comparison to non-log-structured file systems of the time. Managing free space is an important issue, since the
problem is how to make large extents of free space available for writing after many deletions and overwrites of
small files.

2.4. Security
Madhavi Vaidya and Shrinivas Deshpande / Procedia Computer Science 78 (2016) 224 – 232 227

Airavat is the first system which has integrated mandatory access control with differential privacy, many privacy-
preserving MapReduce computations are enabled without the need to audit untrusted code. Recent efforts recognize
the importance of self-protection of such big data management systems, but they mainly focus on data privacy and
correctness19. Airavat is a relevant example to avoid unauthorized access to storage20. It has integrated decentralized
information flow control and differential privacy techniques. It has provided rigorous security control in the
computation for individual data in Map-Reduce frameworks21.The increasing popularity of storing big data and
analytics create a need of secure and efficient data management mechanism. One of the most relevant security topic
for handling such big data refers to preventing the users from damaging the stored data or from breaking security
policies and data-access protocols.

For the security of the large data, techniques such as logging, privacy techniques and encryption etc. are necessary.
IBM researchers also explained that the query processing using MapReduce should take place in a secured
environment using Kerberos. Kerberos is a system of authentication ,developed at MIT. Kerberos uses an encryption
technique along with a trusted third party, an arbitrator, to be able to perform a secure authentication on an open
network22. To be more specific, Kerberos uses cryptographic tickets to avoid transmitting plain text passwords over
the wire. Kerberos is based upon Needham-Schroeder protocol. Distributed file systems is one of the cloud
computing systems, security issues of these systems and technologies are applicable to cloud computing using
Kerberos23,24

3. Study of Various Mechanisms

We have decided to study two mechanisms for analyzing the said parameters. The two approaches which have been
studied here are, MapReduce programming and the Parallel Databases for handling large data and performing
analytics for studying the parameters like Fault Tolerance,Replication, Checkpointing, Dealing with Small files and
Security.

3.1. MapReduce Framework

MapReduce is a framework to easily write applications that process large amounts of data in parallel on clusters of
compute nodes. Generally, in a MapReduce environment, the compute and storage nodes are the same, and
computational tasks run on the same set of nodes that permanently hold the data required for the computations. The
MapReduce algorithm breaks input data into a number of limited-size splits, on a parallel basis. The algorithm
converts the data in each split to a group of intermediate key-value pairs in a set of Map tasks. Then, it collects each
key’s values (a key may have multiple values), in what is called the “shuffle” stage and processes the combined
key/values as output data, via a set of Reduce tasks 25.

3.2. Parallel Databases

A parallel DBMS has horizontal partitioning, where the rows of a relational table are distributed across
nodes of the cluster used to compute expensive tasks in parallel26.

z Handling Large Data - Parallel DBMS were developed to improve the performance of database systems. As
there is an improvement in processor performances, it has been outstripped disk throughput, hence critics have
predicted from time to time that I/O bottleneck would be a major problem.Both MapReduce and Parallel
DBMS provide a means to process large volumes of data. As the volume of data captured continues to rise,
questions have been asked as to whether the parallel DBMS paradigm can scale to meet demands. “There are
no published deployments of parallel database with nodes numbering into the thousands27. As more and more
nodes increase, there is an increase in the chance of a node failure. Parallel DBMS do not handle node failure.
228 Madhavi Vaidya and Shrinivas Deshpande / Procedia Computer Science 78 (2016) 224 – 232

MapReduce has been designed to run on thousands of nodes and is inherently fault tolerant.

z Analytics - Both MapReduce and Parallel DBMS can be used to produce analytical results from big data.
Parallel DBMS uses SQL as the retrieval method, while MapReduce uses programming languages. In many
data mining and data clustering applications, the algorithm is complex and requires multiple passes over the
data which is performed by MapReduce very well28. The output from one subprocess is the input to the next
such algorithms are difficult to implement in SQL. Performing these tasks in many steps reduces the
performance benefits gained from parallel DBMS.

Several aspects to the MapReduce process worth mentioning over Parallel DBMS:
z MapReduce can be transparently scalable. The user does not need to manage data placement or the number of
nodes used for their job. The underlying hardware has no dependencies.
z Data flow is highly defined and in one direction from the Map to the Reduce, with no communication between
independent mapper or reducer processes.
z Because processing is independent, failover is trivial. A failed process can be restarted, provided the
underlying filesystem is redundant like HDFS.
z MapReduce though powerful, does not fit all problem types 29.

The various characteristics of distributed file systems have been studied. Based on the information above,
MapReduce is the mechanism can perform better on various distributed file systems like Ceph, GlusterFS,Lustre
and HDFS. The reason behind studying and illustrating the four file systems because all of them can handle large
data analitycs using MapReduce framework. This paper is written on improving the fault
tolerance,checkpointing,handling small files and a better way to manage volume in big data are made by deploying
Ceph, GlsuterFS and Lustre on Hadoop Distributed File System is one such attempt that increases the efficiency of
HDFS using MapReduce. We have explained here how MapReduce is feasible for all above parameters than the
other mechanism, Parallel Databases which is described in this paper and the Map Reduce algorithm is useful as it
splits a data set to process in parallel over a cluster. Whereas inputs, scheduling, parallelization and machine failures
are handled by the framework itself.

4. Discussion on various Distributed File Systems

It has been discussed at the beginning of this paper that various DFSs which are on network, cluster, cloud or
parallel file systems will be studied in this paper.

4.1. Ceph

Ceph is reliable, easy to manage, and free. The power of Ceph allows to manage vast amounts of data. It
delivers extraordinary scalability where thousands of clients can access petabytes to exabytes of data. A ceph
node can work smoothly on commodity hardware,which accommodates large numbers of nodes, which can
communicate with each other to replicate and redistribute data dynamically30.Ceph stores a client's data in Object-
based Storage Device (OSD)31, it stores a client's data as objects within storage pools. Using the CRUSH
(Controlled Replication Under Scalable Hashing) algorithm,the placement policies can separate the object replicas
across different failure domains but still maintains the desired distribution. It may be desirable to ensure that the data
replicas are placed on devices using different shelves,racks, power supplies, controllers, and/or physical locations to
address the possibility of concurrent failures32.Ceph OSDs are intelligent enough to detect errors and failure
recovery. Since ceph is multi-petabyte scale parallel, network, and distributed file system, its OSDs should also be
dynamic and autonomic enough to do data migration in the time of ceph cluster expansion33.
Madhavi Vaidya and Shrinivas Deshpande / Procedia Computer Science 78 (2016) 224 – 232 229

4.2. GlusterFS

GlusterFS distributes load using a distribute hash translation (DHT) of filenames to it's subvolumes, which
are duplicated to provide load handling and fault tolerance on scale-out distributed file system supporting thousands
of clients34.

In GlusterFS, the elementary storage units are called as Bricks. A server can have one or more bricks where they can
store data via translators on lower-level file systems. GlusterFS can work with three different types of volumes, such
as - Distributed Volumes, Replicated Volumes and Striped volumes35. GlusterFS architecture collaborates several
disks together and memory resources in a single global namespace with one common mount point on a Linux
machine. Thousands of applications and clients can connect to the GlusterFS file system via this mount point and
interact with the stored data36.

4.3. Lustre

Lustre is a file system having high performance computing(HPC) ability and is able to process Big Data.
Here, system performance is of greater importance, the “moving computation” assumption no longer holds true. The
computations are moved to data. Lustre is a cluster file system based on client/server model . Data is stored on
Object Storage Servers (OSSs) and metadata is stored on Metadata Servers (MDSs). Lustre file system achieves
great scalability and performances as the separation in metadata operations is seen from normal data operations.
The mechanisms of journaling provides recoverability from failure conditons.Hadoop’s Distributed File System
(HDFS) and Lustre have similarity in terms of performance and storage capabilities. The basic difference in HDFS
and Lustre is,HDFS has the costs and benefits of local storage while Lustre has the costs and benefits of centrally
located storage37.

4.4. HDFS

A fundamental assumption Hadoop system (HDFS) is based on, “moving computation is cheaper than
moving data”. It means that performing computations on nodes is more efficient than storing the large data locally.
HDFS gives good performance on “commodity” clusters which are inexpensive in nature with relatively slow
network fabrics38. HDFS cluster employs a master-slave architecture consisting of a single NameNode (the master)
and multiple DataNodes (the slaves), usually one per node in the cluster expecting high throughput of data access
rather than low latency of data access. The NameNode manages the file system namespace and regulates access to
files by clients, whereas the DataNodes are responsible for serving read and write requests from the file system’s
clients. In traditional Map/Reduce environments, input and output data are stored on the HDFS as referred in
Fig.1,with intermediate data stored in a local, temporary file system on the Mapper nodes, and shuffled as needed
(via HTTP) to the nodes running the Reducer tasks 39.
230 Madhavi Vaidya and Shrinivas Deshpande / Procedia Computer Science 78 (2016) 224 – 232

Fig. 1 : MapReduce Architecture25

Two main methods used to implement fault tolerance in HDFS: i) Data duplication and ii) Checkpoint and
recovery40. Data duplication consists in duplicating the data in multiple DataNodes as they are distributed. To write
a file to the HDFS, client first contacts the NameNode and then the NameNode nominates a number of (three by
default) DataNodes used to replicate the data. The number of replicas can be increased, improving the fault
tolerance and the bandwidth in reading the file41. The checkpoint and recovery techniques are similar to the concept
of rollback. If a failure occurs, the system rollbacks to the last saved synchronization point, and the transaction starts
again. This method is slower than data duplication, but on the other hand, it needs less additional resources 42.

5. Analytical Study and Discussion on Various Distributed File Systems

A computational paradigm named MapReduce, where an application is divided into many small fragments
of work,each of which may be executed on any node in the cluster. The cluster can be HPC or the client can be a
node ondistributed file system or centralized system. Ceph is unable to provide coherent file chunk replicas and thus
isbandwidth limited. Ceph's main goals are to be completely distributed without a single point of failure, scalable
tothe Exabyte level, and freely-available. The data is replicated, making it fault tolerant. While Ceph avoids
theproblem of having a single meta-data server, it is still limited in terms of the number of file-creates that can
beperformed per second 43. Many of researchers have written that MapReduce can be run on Glusterfs and will
givebetter performance than HDFS. It has been studied that the execution of the same is not given anywhere.
Thecompatibility of Hadoop Plug-in is made with Glusterfs where MapReduce and Hadoop-based applications can
beexecuted so that code rewrites are eliminated and may provide a fault tolerant file system. Glusterfs is a robust
filesystem where fault tolerance is one of the key aspects of the Glusterfs file system and is proved to be reliable
forresilient data storage where metadata server is not the part of this file system44,45. It would seem that the
overarchingadvantage that Gluster provides is its flexibility in terms of volume types, access methods, and
integration withvarious other tools46.

Features Glusterfs Lustre Ceph HDFS

Architecture Scale out Network Attached Parallel- Completely Distributed

File System Distributed Distributed

Node Failure Node failure occurs Chances of node No failure Secondary Namenode takes
failure charge

Placement Manual No Auto Auto

Strategy
Madhavi Vaidya and Shrinivas Deshpande / Procedia Computer Science 78 (2016) 224 – 232 231

Fault Detected Manual Fully connect Fully connect

Detection

Replication Replication No Replication Upto 3 Replicas

Small Files Not Efficient Not Efficient Suitable Small quantity of Big Files

Checkpointi Favourable High Favourable High

Security IP/port-based access Posix Object Replication Using Kerberos

controls AccessControl and Erasure
List Codeing

Fig.2 : Comparison based on DFS46

While MapReduce implementations provide a straightforward job submission process which involves the whole
cluster, HPC users submit their jobs to a Resource Management System (RMS) and need to specify the number of
nodes and amount of time that should be allocated for complete the job execution.This approach may confuse
typical MapReduce users that are not used to do it, and they may, in return, always try to allocate the whole cluster
for the longest time as possible. As a consequence, the turnaround time of the MapReduce job will probably
increase47. The reliability is ensured by HDFS using replication factor and Lustre does not replicate (Fig.2) any data
but ensures reliability by having many Object Storage Servers connecting to multiple Object Storage
Targets48.Finally,we would like to state that MapReduce framework is the right mechanism to apply on the
distributed file systems which satisfies the large data analyics, it maintains the performance of the nodes whether
they are client/server nodes in case of HDFS or HPC nodes as mentioned in Fig.2 in case of Lustre Parallel
Distributed File System. Following is the analytical study which shows the use of MapReduce on Distributed File
Systems and performance of MapReduce on the parameters elaborated in this paper.

6. Conclusion

To conclude,MapReduce is a programming model for processing large data sets with a parallel, distributed
algorithm on a cluster. It is a framework that facilitates writing arbitrary distributed processing frameworks and
applications. The choice of backing file system or cluster on HPC, the MapReduce supports the data availability,
placement and data security concern. The desire to integrate MapReduce on the distributed node or any HPC cluster
will be the part of the future study and implementations. However, comparative testing on a much larger scale would
be undertaken in the future.

References

1. F.D. Sacerdoti,”Performance and Fault Tolerance in the StoreTorrent Parallel Filesystem”,arXiv Publications; 2010,pp. 1-13.
2. LI Jing-min, HE Guo-hui,Research of Distributed Database System Based on Hadoop, In Proc. of Information Science and Engineering
(ICISE);2010,pp.1417 - 1420.
3. M. C. Chan, J. R. Jiang, and S. T. Huang,Fault tolerant and secure networked storage,7th International Conference on Digital Information
Management, ICDIM;2012, pp. 186-191.
4. S.Verkuil, A Comparison of Fault-Tolerant Cloud Storage File Systems,19th Twente Student Conference on IT;2013.
5. A.J. Borr. Transaction Monitoring in ENCOMPASS: Reliable Distributed Transaction Processing, in : Proceedings of the 7th Intl. Conf. On
Very Large Data Bases;1981,pp.155–165.
6. David A. Patterson, Garth A. Gibson, and Randy H. Katz. A Case for Redundant Arrays of Inexpensive Disks (RAID). In : Proceedings of the
1985 ACM SIGMOD International Conference on Management of Data;1988,pp. 109–116.
7. T. Kosar, Data Intensive Distributed Computing: Challenges and Solutions for Large Scale Information Management, IGI Publications;2011.
8. A.Silberschatz,P.Baer Galvin,G.Gagne, Operating Systems , Publication By John Wiley & Sons.
9. L. Lamport. Time, clocks, and the ordering of events in a distributed system. Commun. ACM;1978.
10.J.Bent,G.Gibson,G.Ben McClell,P.Nowoczynski, J. Nunez, M.Polte, M. Wingat,PLFS: A Checkpoint Filesystem for Parallel
232 Madhavi Vaidya and Shrinivas Deshpande / Procedia Computer Science 78 (2016) 224 – 232

Applications,in:proceedings of the Conference on High Performance Computing Networking, Storage and Analysis;2009.
11. S.A. Kiswany, M. Ripeanu, S. S. Vazhkudai, A. Gharaibeh.,stdchk: A Checkpoint Storage System for Desktop Grid Computing.In
Proceedings of the 28th International Conference on Distributed Computing Systems ,ICDCS 2008;June 2008,pp. 613-624.
12. S.A. Kiswany, M. Ripeanu, S. S. Vazhkudai, Aggregate Memory as an Intermediat Checkpoint Storage Device, Technical Report ORNL
Technical Report 013521.
13. J. S. Plank, K. Li, and M. A. Puening. Diskless Checkpointing, IEEE Transactions on Parallel and Distributed Systems;October 1998;
pp.972–986.
14. A. C. Arpaci-Dusseau. Implicit Coscheduling: Coordinated Scheduling with Implicit Information in Distributed Systems. PhD Thesis,
University of California, Berkeley;1998.
15 P. Hu, W. Dai. Enhancing Fault Tolerance based on Hadoop Cluster,International Journal of Database Theory and Application,Vol.7, No.1;
2014, pp.37-48.
16 R.R. Sambasivan, S.Sinnamohideen, G.R.Ganger,J.Hendricks. Improving Small File Performance in Object-Based Storage;CMU-PDL-06
104;May 2006,pp. 1-19.
17. P. Carns, S.Lang, R.Ross,M.Vilayannur,J.Kunkel, T. Ludwig, Small-File Access in Parallel File Systems, Journal of Network and Compute
Applications 35;2012, pp.1847–1862.
18. M.Rosenblum,J.K.Ousterhout, The Design and Implementation of a Log Structured File System. ACM Transactions on Computer Systems
;Feb 1992,pp. 26–52.
19. K.D. Bowers, A.Juels,A.Oprea. Hail: A High-Availability and Integrity Layer for Cloud Storage. In: Proceedings of the 16th ACM
conference on Computer and communications security, CCS ’09, New York, NY, USA;2009,pp.187–198.
20. I.Roy, S. T.V. Setty.A.Kilzer,V. Shmatikov,E.Witchel,Airavat: Security and Privacy for MapReduce;2010,pp.1-16.
21. G.Ateniese, R.D.Pietro, L.V.Mancini, G.Tsudik,Scalable and Efficient Provable Data Possession, in:Proceedings of the 4th International
Conference on Security and Privacy in Communication Networks, SecureComm’08;2008. ACM,pp. 1–9:10.
22. Q Tran,A solution for privacy protection in MapReduce,Computer Software and Applications Conference (COMPSAC), IEEE 36th Annual
Conference;Jul 2012,pp.515-520.
23. V.N.Inukollu1,S.Arsi1,S.Rao Ravuri,Secutiy Issues Associated with Big Data Cloud Computing,in: International Journal of Network
Security & Its Applications (IJNSA), vol.6, no.3;May 2014,pp.45-56.
24.Source-https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SecureMode.html
25.J.Dean,S.Ghemawat,MapReduce: Simplified Data Processing on Large Clusters,Communication of The ACM;Jan. 2008,pp. 107-113.
26. A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi, A. Silberschatz, and A. Rasin, Hadoopdb: An Architectural Hybrid of MapReduce and
DBMSTechnologies for Analytical Workloads, in: Proeedings. VLDB Endow., vol. 2, no. 1; Aug. 2009, pp. 922–933.
27. M. Stonebraker,D. Abadi,D. J. DeWitt,S. Madden,E. Paulson, A. Pavlo, A. Rasin,MapReduce and Parallel DBMSs: Friends or Foes?
Communications of ACM, Vol. 53, no. 1, Jan. 2010. [Online]. Available: http: //0-doi.acm.org.ditlib.dit.ie/10.1145/1629175.1629197;pp.64-
71.
28. A. McClean, R.C. Conceição,M.O’Halloran,A Comparison of MapReduce and Parallel Database Management Systems, ICONS:The Eighth
International Conference on Systems;2013.
29. Article by - D.Eadline, Is Hadoop the New HPC?;Newsartcile from Admin-Magazine;2015.
30. Source - http://docs.ceph.com/docs/master/architecture/; Ceph Architecture.
31. D.Dalessandro P.Wyckoff P. Sadayappan Nawab Ali, A.Devulapalli,An OSD-Based Approach to Managing Directory Operations in Parallel
File Systems;2008,pp.175-184.
32. S.A.Weil,S.A.Brandt,E.L.Miller,C.Maltzahn,CRUSH:Controlled, Scalable, Decentralized Placement of Replicated Data;IEEE Explore;2006.
33. Sage A.Weil. Ceph: Reliable, scalable, and high-performance distributed storage;2007.
34. Joe Julian - Article on GlusterFS Replication Do's and Don'ts;2013.
35. Source : About Gluster, [Online :http://www.gluster.org/about/];2007.
36. http://www.systemfabricworks.com/products/system-fabric-works-storage-solutions/lustre-hadoop
37. N.Rutman,White Paper on MapReduce on Lustre;2011.
38. V.S.Patil, P.D.Soni, Hadoop Skeleton & Fault Tolerance in Hadoop Clusters,International Journal of Application or Innovation in
Engineering & Management, Volume 2, Issue 2;February 2013,pp.247-250.
39. J. Evans, Fault Tolerance in Hadoop for Work Migration, Technical Report CSCI B534 (Survey Paper), Indiana University;November 2011.
40. I.Goiri,F.Julià,J.Guitart,J.Torres,Checkpoint-Based Fault-Tolerant Infrastructure for Virtualized Service Providers. IEEE/IFIP Network
Operations and Management Symposium,IEEE. Osaka, Japan;April 2010,pp. 455-462.
41. M.C. Srivas et.al,Map-Reduce Ready Distributed File System,Patented Document Appl. No.: 13/162,439;Dec 2011.
42. R. Buyya, J.Broberg,A.Goscinski,A book on Cloud Computing Principles and Paradigms,Wiley Publications;2011.
43. A.T.Yimer, Investigating the Performance, Scalability & Reliability of a Distributed FileSystem:Ceph;May 2011.
44. J. Edge, An introduction to GlusterFS;March 2015.
45. Source - http://www.gluster.org/community/documentation/index.php/Hadoop
46. B.Depardon,G.L.Mahec,C.S´eguin,Analysis of Six Distributed File Systems;Feb 2013;pp. 2-37.
47. M.V.Neves,T.Ferreto,C.D.Rose,Scheduling MapReduce Jobs in HPC clusters?,ProceedingEuro-Par'12 Proceedings of the 18th International
Conference on Parallel Processing,2012;pp.179-190.
48. Karan Singh, A book on Learning Ceph;Jan 2015.

Docs Template
No ratings yet
Docs Template
12 pages
High Performance Fault-Tolerant Hadoop Distributed File System
No ratings yet
High Performance Fault-Tolerant Hadoop Distributed File System
9 pages
High Performance Fault-Tolerant Hadoop Distributed File System
No ratings yet
High Performance Fault-Tolerant Hadoop Distributed File System
9 pages
Electronics: Performance Evaluations of Distributed File Systems For Scientific Big Data in FUSE Environment
No ratings yet
Electronics: Performance Evaluations of Distributed File Systems For Scientific Big Data in FUSE Environment
16 pages
A Comparative Study of The Architectures and Applications of Scalable High-Performance Distributed File Systems
No ratings yet
A Comparative Study of The Architectures and Applications of Scalable High-Performance Distributed File Systems
11 pages
Term Paper
No ratings yet
Term Paper
6 pages
7 A Taxonomy and Survey On Distributed File Systems
No ratings yet
7 A Taxonomy and Survey On Distributed File Systems
6 pages
unit III
No ratings yet
unit III
120 pages
Distributed File Systems: Pavel Bžoch
No ratings yet
Distributed File Systems: Pavel Bžoch
36 pages
11 06226382 Load Rebancing DFS in Cloud PDF
No ratings yet
11 06226382 Load Rebancing DFS in Cloud PDF
12 pages
Comprehensive Guide to Azure HDInsight: Definitive Reference for Developers and Engineers
From Everand
Comprehensive Guide to Azure HDInsight: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
DC - PPT A Case Study On Distributed File Systems
No ratings yet
DC - PPT A Case Study On Distributed File Systems
17 pages
Distributed File System Questions and Answers
100% (1)
Distributed File System Questions and Answers
6 pages
2403.15701v1
No ratings yet
2403.15701v1
10 pages
A Novel Distributed File System Using Blockchain Metadata
No ratings yet
A Novel Distributed File System Using Blockchain Metadata
20 pages
Unit 1 (Chapter 2) - Big Data Storage
No ratings yet
Unit 1 (Chapter 2) - Big Data Storage
34 pages
Ceph Architecture and Administration: Definitive Reference for Developers and Engineers
From Everand
Ceph Architecture and Administration: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Rev. Lecture 1 PPT2
No ratings yet
Rev. Lecture 1 PPT2
24 pages
Reliability and Architecture of HDFS: Definitive Reference for Developers and Engineers
From Everand
Reliability and Architecture of HDFS: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Bigdata
No ratings yet
Bigdata
9 pages
Rsync Solutions: Definitive Reference for Developers and Engineers
From Everand
Rsync Solutions: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
LSJ1347 - Load Rebalancing For Distributed File Systems in Clouds
No ratings yet
LSJ1347 - Load Rebalancing For Distributed File Systems in Clouds
4 pages
Sqoop Essentials: Definitive Reference for Developers and Engineers
From Everand
Sqoop Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Wa0001.
No ratings yet
Wa0001.
56 pages
Cohesity Architecture and Administration: Definitive Reference for Developers and Engineers
From Everand
Cohesity Architecture and Administration: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Attributes of Fault-Tolerant Distributed File Systems
No ratings yet
Attributes of Fault-Tolerant Distributed File Systems
69 pages
Snowflake Data Platform Engineering: Definitive Reference for Developers and Engineers
From Everand
Snowflake Data Platform Engineering: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Large Scale Distributed File System Survey
No ratings yet
Large Scale Distributed File System Survey
7 pages
OctopusFS: A Distributed File System with Tiered Storage
No ratings yet
OctopusFS: A Distributed File System with Tiered Storage
14 pages
Applied Hudi Systems: Definitive Reference for Developers and Engineers
From Everand
Applied Hudi Systems: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
DC MOD 6
No ratings yet
DC MOD 6
9 pages
Essential Backup Strategies and Techniques: Definitive Reference for Developers and Engineers
From Everand
Essential Backup Strategies and Techniques: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Notes - 3 Unit neha
No ratings yet
Notes - 3 Unit neha
25 pages
BDA Module 2 COMP
No ratings yet
BDA Module 2 COMP
29 pages
What Is DFS
No ratings yet
What Is DFS
37 pages
Advanced Network Backup with Amanda: Definitive Reference for Developers and Engineers
From Everand
Advanced Network Backup with Amanda: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
When it comes to cloud file systems like GFS
No ratings yet
When it comes to cloud file systems like GFS
6 pages
Deploying and Managing Applications with DigitalOcean: Definitive Reference for Developers and Engineers
From Everand
Deploying and Managing Applications with DigitalOcean: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Mapreduce: Simplified Data Processing On Large Clusters
No ratings yet
Mapreduce: Simplified Data Processing On Large Clusters
38 pages
Principles of MapReduce Systems: Definitive Reference for Developers and Engineers
From Everand
Principles of MapReduce Systems: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
File Systems For Various Operating Systems: A Review
No ratings yet
File Systems For Various Operating Systems: A Review
15 pages
Datastore Architecture and Implementation: Definitive Reference for Developers and Engineers
From Everand
Datastore Architecture and Implementation: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
DW - Bigdata9
No ratings yet
DW - Bigdata9
113 pages
Distributed File Systems Leading To Hadoop File System: UNIT-2
No ratings yet
Distributed File Systems Leading To Hadoop File System: UNIT-2
12 pages
Hadoop Ecosystem for Big Data
From Everand
Hadoop Ecosystem for Big Data
Dr. Zemelak Goraga
No ratings yet
Distributed File System
No ratings yet
Distributed File System
7 pages
The Ceph Handbook: Building and Managing Scalable Distributed Storage Systems
From Everand
The Ceph Handbook: Building and Managing Scalable Distributed Storage Systems
Robert Johnson
No ratings yet
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
FSObeserver
No ratings yet
FSObeserver
12 pages
A Review On Fault Tolerance in Distributed Database
No ratings yet
A Review On Fault Tolerance in Distributed Database
4 pages
Distributed File Systems-2
No ratings yet
Distributed File Systems-2
4 pages
InfluxDB Essentials: Definitive Reference for Developers and Engineers
From Everand
InfluxDB Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Advanced Fuse Implementation: Definitive Reference for Developers and Engineers
From Everand
Advanced Fuse Implementation: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
What Are Basic Characteristics of Data and How Is Parallel Processing System Different From Distributed System?
No ratings yet
What Are Basic Characteristics of Data and How Is Parallel Processing System Different From Distributed System?
24 pages
What Are Basic Characteristics of Data and How Is Parallel Processing System Different From Distributed System?
No ratings yet
What Are Basic Characteristics of Data and How Is Parallel Processing System Different From Distributed System?
24 pages
Big Data Assighmwnt 2
No ratings yet
Big Data Assighmwnt 2
60 pages
Cloud Computing - Unit 3
No ratings yet
Cloud Computing - Unit 3
38 pages
CS19741-Cloud Computing-Unit 3 Notes
No ratings yet
CS19741-Cloud Computing-Unit 3 Notes
37 pages
Unit 5 CC
No ratings yet
Unit 5 CC
8 pages
Comprehensive Guide to Dash Applications: Definitive Reference for Developers and Engineers
From Everand
Comprehensive Guide to Dash Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
2019 Book Supercomputing PDF
No ratings yet
2019 Book Supercomputing PDF
224 pages
Cloud_Computing_Notes
No ratings yet
Cloud_Computing_Notes
87 pages
1 4974336693076230922
No ratings yet
1 4974336693076230922
23 pages
Design Issues of DS
No ratings yet
Design Issues of DS
21 pages
m5 Dbms PDF
No ratings yet
m5 Dbms PDF
14 pages
Rajalakshmi Engineering College: Part - B
No ratings yet
Rajalakshmi Engineering College: Part - B
5 pages
JETIR2211472 Reseach Paper PDF
No ratings yet
JETIR2211472 Reseach Paper PDF
4 pages
006 AWS Data Sync
No ratings yet
006 AWS Data Sync
4 pages
17 03 2021
100% (1)
17 03 2021
3 pages
L9 - Cloud Storage
No ratings yet
L9 - Cloud Storage
35 pages
Block Chain Syllabus
No ratings yet
Block Chain Syllabus
2 pages
Nansen - State of The Crypto Industry Report - 2021
No ratings yet
Nansen - State of The Crypto Industry Report - 2021
63 pages
Cloud Security - Quiz Attempt Review
No ratings yet
Cloud Security - Quiz Attempt Review
2 pages
Big Data Technologies PG-DBDA March 2022
No ratings yet
Big Data Technologies PG-DBDA March 2022
8 pages
NFT 2018 Report VF 190219095958
No ratings yet
NFT 2018 Report VF 190219095958
38 pages
Blockchain
No ratings yet
Blockchain
16 pages
Page 003
No ratings yet
Page 003
1 page
Blockchain (1)
No ratings yet
Blockchain (1)
1 page
Blockchain Quiz a
No ratings yet
Blockchain Quiz a
16 pages
Hacklu_CTISummit2023_IPFS_Unveiled_OSINT_CTI
No ratings yet
Hacklu_CTISummit2023_IPFS_Unveiled_OSINT_CTI
37 pages
Cloud Computing Question Bank
No ratings yet
Cloud Computing Question Bank
6 pages
Hyperledger Fabric (Article) PDF
No ratings yet
Hyperledger Fabric (Article) PDF
15 pages
Lewis Part4 Bitcoin
No ratings yet
Lewis Part4 Bitcoin
78 pages
Model question paper _Big data_2024-25_kca022
No ratings yet
Model question paper _Big data_2024-25_kca022
3 pages
Process Synchronization: Silberschatz, Galvin and Gagne ©2013 Operating System Concepts - 9 Edition
No ratings yet
Process Synchronization: Silberschatz, Galvin and Gagne ©2013 Operating System Concepts - 9 Edition
26 pages
Air Flow Clustering HA
No ratings yet
Air Flow Clustering HA
32 pages
ACID Properties
No ratings yet
ACID Properties
9 pages
Blockchair PDF Recepit 12:13:2024
No ratings yet
Blockchair PDF Recepit 12:13:2024
36 pages
Chapter4 2
No ratings yet
Chapter4 2
51 pages
Map Reduce
No ratings yet
Map Reduce
11 pages

Critical Study of Performance Parameters On Distributed File

Uploaded by

Critical Study of Performance Parameters On Distributed File

Uploaded by

Available online at www.sciencedirect.

Critical Study of Performance Parameters on Distributed File

*Madhavi Vaidya. Tel.: +91-9869026553.

E-mail address: [email protected]

2.1. Fault Tolerance and Replication

2.3. Dealing with Small Files

3. Study of Various Mechanisms

3.1. MapReduce Framework

3.2. Parallel Databases

4. Discussion on various Distributed File Systems

Fig. 1 : MapReduce Architecture25

5. Analytical Study and Discussion on Various Distributed File Systems

Features Glusterfs Lustre Ceph HDFS

Architecture Scale out Network Attached Parallel- Completely Distributed

Placement Manual No Auto Auto

Fault Detected Manual Fully connect Fully connect

Replication Replication No Replication Upto 3 Replicas

Checkpointi Favourable High Favourable High

Security IP/port-based access Posix Object Replication Using Kerberos

Fig.2 : Comparison based on DFS46

You might also like