Docs Template
Docs Template
In Partial Fulfillment
of the Requirements for the Course of
Human Computer Interaction and SIA
Abstract
Every day, there is a lot of data generated by the network. MapReduce is a good
programming model that is parallel for large processing of data. The paper that was
reviewed surveyed several distributed storage and distributed systems. Various
parameters were studied. These are Fault Tolerance, Replication, Checkpointing, Security
and Optimizing small file access using MapReduce and distributed file systems that were
reviewed were Ceph, GlusterFS, HDFS and Lustre. These distributed file systems were
open source. Also, one of the important distributed file systems that plays an important
role in protecting application data is Cloud Computing. The authors said that their paper
is efficient and scalable because of MapReduce application.
There is a lot of data generated from the network is growing every day. Massive data
processing and inadequacy in storage of traditional databases is observed well. A
distributed file system stores data on multiple nodes, and to remove the bottleneck.
Clients are allowed to access data in parallel from the storage nodes. Furthermore, DFSs
are sometimes called network. Distributed file systems provide persistent storage of
unstructured data, which are organized in a hierarchical namespace of files that is shared
among networked nodes. This data model and the interface to applications distinguishes
distributed file systems from other types such as databases. To applications, it should be
transparent whether data is on a distributed file system or stored on a local file system.
Even though the file system interface is general and fits a broad spectrum of
applications, most distributed file system implementations are optimized for applications
particularly. There is often a need to survey and study the distributed systems. These
cases differ qualitatively and quantitatively. Every use case poses high requirements in
only some of the dimensions. All of the use cases combined, however, would require a
distributed file system in every dimension which the performance must be outstanding.
Some systems requirements contradict each other: a high-level redundancy (e.g., for
recorded experiment data) reduces inevitably the right throughput in cases where
redundancy is not needed (e.g., for a scratch area). The file system interface has no
standard to provide a way to specify quality of service properties for particular directories
or files.
Related Work
There are so many related works that is related or synonymous to the paper that is
about to be reviewed. Among these are the following published papers: (1) Cassandra
File System Over Hadoop Distributed File System by Mr. Ashish A. Mutha and Miss.
Vaishali M. Deshmukh, both from PRMIT&R College, Amravati, India. (2) A Survey on
Distributed File System Technology by J Blomer of Switzerland. These published studies
were recommended well and being used as reference respectively by the succeeding
researchers.
Data is distributed over multiple machines where there are chances of failure of networks.
A simple distributed file system is needed with integrated fault tolerance for efficient
handling of small records of data. Fault-tolerant data storage is becoming popular as
moving data is done to the cloud. Actually, fault tolerance is achieved through the
division of a file into smaller chunks or fragments, which are processed and managed by
a set of servers. Fault-tolerance is an important aspect in cloud storage where a major
concern is the robustness of data.
Checkpointing
It has been presented in a log-structured file system, authors claim that such a system
exhibits a performance increase of an order of magnitude for small-file writes, while
matching the performance for large files in comparison to non-log-structured file
systems. Managing free space is an important issue, since the problem is how to make
large extents of free space available for writing after many overwrites and deletions of
small files. The object-based storage systems like Ceph for example, has the data
organized and can be accessed, struggle with workloads that access large number of small
files, which are developed through user workspaces and software, there are two reasons
DHVSU LUBAO CAMPUS
for this: loss of namespace locality at the storage devices and interactions of each file
with metadata server.
Security
Recent efforts recognize the importance of self-protection of such big data management
systems, but they mainly focus on correctness and data privacy. The increasing popularity
of storing analytics and big data create a need of efficient and secure data management
mechanism. One of the most relevant security topics for handling such big data refers to
preventing the users from damaging the stored data or from breaking data-access
protocols and security policies. For the security of large data, techniques such as logging,
encryption and privacy techniques etc. are necessary. Distributed file system is one of the
cloud computing systems, security issues of these systems and technologies are
applicable to cloud computing. IBM researchers prefer that using Kerberos as an
application to secure its data environment.
Tool Support
These are the tools that support the study of Distributed File System using
MapReduce framework.
Ceph
Ceph is free, easy to use, and reliable. Its power allows to manage vast amounts of data.
Ceph delivers extraordinary scalability where thousands of clients can access petabytes to
exabytes of data. A ceph node can work smoothly on commodity hardware, which
DHVSU LUBAO CAMPUS
accommodates large numbers of nodes, which can communicate with each other to
dynamically redistribute and replicate data. the placement policies can separate the object
replicas across different failure domains but still maintains the desired distribution using
the CRUSH (Controlled Replication Under Scalable Hashing) algorithm.
GlusterFS
In GlusterFS, the elementary storage units are called as Bricks. A server can have more
than one bricks where they can store data through translators on lower-level file systems.
GlusterFS distributes load using a distribute hash translation (DHT) of filenames to its
subvolumes, which are duplicated to provide fault tolerance and load handling on scale-
out distributed file system supporting thousands of clients.
Lustre
Lustre is a file system having high performance computing (HPC) ability and has an
ability to process Big Data. Lustre is a cluster file system based on client/server model.
Lustre file system achieves great performances and scalability as the separation in
metadata operations is seen from normal data operations. Data is stored on Object
Storage Servers (OSSs) and metadata is stored on Metadata Servers (MDSs). Lustre and
Hadoop’s Distributed File System (HDFS) have similarity in terms of performance and
storage capabilities.
Application Integrity
Decomposition
Cryptographic hashes of the content of files are often used to ensure data integrity.
Cryptographic hashes provide a short, constant length, unique identifier for data of any
size. Collisions are virtually impossible to occur neither by chance nor by clever crafting,
which makes cryptographic hashes a means to protect against data tampering. It also
results in immutable data, which is keeping cache consistency and eliminates the problem
of detecting stale cache entries. Furthermore, redundant data and duplicated files are
DHVSU LUBAO CAMPUS
Efficiency
Caching and file striping are standard techniques to improve the speed of distributed file
systems. Caches can be located in memory, in flash memory, or on hard disks. Caches are
most often managed per file system node. Cache sizes needs to be manually tuned
according to the working set size of applications. Co-operative caches between nodes in a
local network have been discussed but they are not implemented in today’s production
file systems. Dynamic workload adaptation is a technique used in the Ceph file system to
change the metadata-to-metadata mapping of servers based on the server load.
Highlights
Handling Large Data - Both MapReduce and Parallel DBMS provide a means to process
large volumes of data. As the volume of data captured continues to rise, questions have
been asked as to whether the parallel DBMS paradigm can scale to meet demands.
Parallel DBMS were developed to improve the performance of database systems. As
there is an improvement in processor performances, it has been outstripped disk
throughput, hence critics have predicted from time to time that I/O bottleneck would be a
DHVSU LUBAO CAMPUS
major problem. MapReduce has been designed inherently fault tolerant and is to run on
thousands of nodes.
Analytics - The output from one subprocess is the input to the next such algorithms are
difficult to implement in SQL. Performing these tasks in many steps reduces the
performance benefits gained from parallel DBMS. Both MapReduce and Parallel DBMS
can be used to produce analytical results from big data.
Impact
Data flow is highly defined and in one direction from the Map to the Reduce, with
no communication between independent mapper or reducer processes.
Strengths
Ceph is unable to provide coherent file chunk replicas and thus is bandwidth limited.
A computational paradigm named MapReduce, where an application is divided into many
small fragments of work, each of which may be executed on any node in the cluster. The
cluster can be HPC or the client can be a node on distributed file system or centralized
system. The data is replicated, making it fault tolerant. Many of researchers have written
that MapReduce can be run on Glusterfs and will give better performance than HDFS.
Weaknesses
In the upcoming years, the computing landscape will move towards the exascale.
That means data sets that routinely sum up to exabytes and supercomputers that provide
computing power in the exaflop range. While the gap between capacity and bandwidth
widened by one to two orders of magnitude in the last 20 years, the bandwidth of
Ethernet networks scaled at a similar pace than the capacity of hard drives. Raicu et al.
predict the collapse of exaflop supercomputing applications due to the limited storage
bandwidth and the architecture of todays distributed file systems. Experts suggests to
break the segregation between computer networks and storage networks and to build
distributed file systems with the following characteristics. The desire to integrate
MapReduce on the distributed node or any HPC cluster will be the part of the future
study and implementations.
DHVSU LUBAO CAMPUS
Conclusions
Storyboard
References
8. T. Kosar, Data Intensive Distributed Computing: Challenges and Solutions for Large
Scale Information Management, IGI Publications;2011.
9. A.Silberschatz,P.Baer Galvin,G.Gagne, Operating Systems , Publication By John
Wiley & Sons.
10. L. Lamport. Time, clocks, and the ordering of events in a distributed system.
Commun. ACM;1978.
11. J.Bent,G.Gibson,G.Ben McClell,P.Nowoczynski, J. Nunez, M.Polte, M.
Wingat,PLFS: A Checkpoint Filesystem for Parallel