0% found this document useful (0 votes)
30 views9 pages

Bigdata

bbbbbbbbbbbbbbbbbbkjjj jklllllllllllllllllllllllllllllllllll

Uploaded by

sumanice
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views9 pages

Bigdata

bbbbbbbbbbbbbbbbbbkjjj jklllllllllllllllllllllllllllllllllll

Uploaded by

sumanice
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

Conclusion:

In Big Data analytics, the high dimensionality and the streaming nature of the
incoming data aggravate great computational challenges in data mining. Big Data
grows continually with fresh data are being generated at all times; hence it requires
an incremental computation approach which is able to monitor large scale of data
dynamically. Lightweight incremental algorithms should be considered that are
capable of achieving robustness, high accuracy and minimum pre-processing
latency. In this paper, we investigated the possibility of using a group of
incremental classification algorithm for classifying the collected data streams
pertaining to Big Data. As a case study empirical data streams were represented by
five datasets of different do-main that have very large amount of features, from
UCI archive. We compared the traditional classification model induction and their
counter-part in incremental inductions. In particular we proposed a novel
lightweight feature selection method by using Swarm Search and Accelerated PSO,
which is supposed to be useful for data stream mining. The evaluation results
showed that the incremental method obtained a higher gain in accuracy per second
incurred in the pre-processing. The contribution of this paper is a spectrum of
experimental insights for anybody who wishes to design data stream mining
applications for big data analytics using lightweight feature selection approach
such as Swarm Search and APSO. In particular, APSO is designed to be used for
data mining of data streams on the fly. The combinatorial explosion is addressed
by used swarm search approach applied in incremental manner. This approach also
fits better with real-world applications where their data arrive in streams. In
addition, an incremental data mining approach is likely to meet the demand of big
data problem in service computing.
BIG-DATA:

MAP-REDUCE has emerged as a popular and easy-to-use programming model for cloud
computing. It has been used by numerous organizations to process explosive amounts of data,
perform massive computation, and extract critical knowledge for business intelligence. Hadoop
is an open source implementation of MapReduce, currently maintained by the Apache
Foundation, and supported by leading IT companies such as Facebook and Yahoo!. Hadoop
implements MapReduce framework with two categories of components: a JobTracker and many
Task- Trackers.
The JobTracker commands TaskTrackers (a.k.a. slaves) to process data in parallel
through two main functions: map and reduce. In this process, the JobTracker is in charge of
scheduling map tasks (MapTasks) and reduce tasks (ReduceTasks) to TaskTrackers. It also
monitors their progress, collects runtime execution statistics, and handles possible faults and
errors through task reexecution. Between the two phases, a ReduceTask needs to fetch a part of
the intermediate output from all finished MapTasks. Globally, this leads to the shuffling of
intermediate data (in segments) from all MapTasks to all ReduceTasks. For many data-intensive
MapReduce programs, data shuffling can lead to a significant number of disk operations,
contending for the limited I/O bandwidth.
This presents a severe problem of disk I/O contention in MapReduce programs, which
entails further research on efficient data shuffling and merging algorithms. proposed the
MapReduce Online architecture to open up direct network channels between MapTasks and
ReduceTasks and speed up the delivery of data from MapTasks to ReduceTasks. It remains as a
critical issue to examine the relationship of Hadoop MapReduce’s three data processing phases,
i.e., shuffle, merge, and reduce, and their implication to the efficiency of Hadoop. With an
extensive examination of Hadoop MapReduce framework, particularly its ReduceTasks, we
reveal that the original architecture faces a number of challenging issues to exploit the best
performance from the underlying system. To ensure the correctness of MapReduce, no
ReduceTasks can start reducing data until all intermediate data have been merged together. This
results in a serialization barrier that significantly delays the reduce operation of ReduceTasks.
More importantly, the current merge algorithm in Hadoop merges intermediate data segments
from MapTasks when the number of available segments (including those that are already
merged) goes over a threshold.
These segments are spilled to local disk storage when their total size is bigger than the
available memory. This algorithm causes data segments to be merged repetitively and, therefore,
multiple rounds of disk accesses of the same data. To address these critical issues for Hadoop
MapReduce framework, we have designed Hadoop-A, a portable acceleration framework that
can take advantage of plug-in components for performance enhancement and protocol
optimizations. Several enhancements are introduced: 1) a novel algorithm that enables
ReduceTasks to perform data merging without repetitive merges and extra disk accesses; 2) a full
pipeline is designed to overlap the shuffle, merge, and reduce phases for ReduceTasks; and 3) a
portable implementation of Hadoop-A that can support both TCP/ IP and remote direct memory
access (RDMA).
Hadoop File System
 Highly fault-tolerant

 High throughput

 Suitable for applications with large data sets

 Streaming access to file system data

 Can be built out of commodity hardware

Fault tolerance:
Failure is the norm rather than exception A HDFS instance may consist of thousands of
server machines, each storing part of the file system’s data. Since we have huge number of
components and that each component has non-trivial probability of failure means that there is
always some component that is non-functional. Detection of faults and quick, automatic
recovery from them is a core architectural goal of HDFS.

Data Characteristics:

 Streaming data access.

 Applications need streaming access to data

 Batch processing rather than interactive user access.

 Large data sets and files: gigabytes to terabytes size

 High aggregate data bandwidth

 Scale to hundreds of nodes in a cluster

 Tens of millions of files in a single instance

 Write-once-read-many: a file once created, written and closed need not be changed – this
assumption simplifies coherency

 A map-reduce application or web-crawler application fits perfectly with this model.


MapReduce:

reduce

combine
part
map
split 0
reduce

combine
part
map
split 1
Cat
Bat
Dog reduce

combine
Other map part
split
Words 2
(size:
TByte)
map
split

Key Features of MapReduce Model:


• Designed for clouds
• Large clusters of commodity machines
• Designed for big data
• Support from local disks based distributed file system (GFS / HDFS)
• Disk based intermediate data transfer in Shuffling
• MapReduce programming model
• Computation pattern: Map tasks and Reduce tasks
• Data abstraction: KeyValue pairs
HDFS Architecture:

Namenode
Metadata(Name,
Metadat
a ops replicas..)
Client (/home/foo/data
Block,6. ..
Re ops Datan
Datan
ad odes odes

replica B
tion
Blo
cks

Ra Wr Ra
ck1 ite ck2
Client

File system Namespace:


 Hierarchical file system with directories and files Create, remove, move, rename etc.
NameNode maintains the file system any Meta information changes to the file system
recorded by the NameNode. An application can specify the number of replicas of the file
needed: replication factor of the file. This information is stored in the NameNode.

Data Replication:
 HDFS is designed to store very large files across machines in a large cluster. Each file is
a sequence of blocks. All blocks in the file except the last are of the same size. Blocks are
replicated for fault tolerance. Block size and replicas are configurable per file. The
NameNode receives a Heartbeat and a Block Report from each DataNode in the cluster.
Block Report contains all the blocks on a Datanode.
Replica Placement:
 The placement of the replicas is critical to HDFS reliability and performance. Optimizing
replica placement distinguishes HDFS from other distributed file systems. Rack-aware
replica placement:
 Goal: improve reliability, availability and network bandwidth utilization
 Research topic
 Many racks, communication between racks are through switches. Network bandwidth
between machines on the same rack is greater than those in different racks. NameNode
determines the rack id for each DataNode. Replicas are typically placed on unique racks
 Simple but non-optimal
 Writes are expensive
 Replication factor is 3
 Another research topic?
 Replicas are placed: one on a node in a local rack, one on a different node in the local
rack and one on a node in a different rack. 1/3 of the replica on a node, 2/3 on a rack and
1/3 distributed evenly across remaining racks.

NameNode:
 Keeps image of entire file system namespace and file Blockmap in memory. 4GB of local
RAM is sufficient to support the above data structures that represent the huge number of
files and directories. When the Namenode starts up it gets the FsImage and Editlog from
its local file system, update FsImage with EditLog information and then stores a copy of
the FsImage on the filesytstem as a checkpoint. Periodic checkpointing is done. So that
the system can recover back to the last checkpointed state in case of a crash.

Datanode:

 A Datanode stores data in files in its local file system. Datanode has no knowledge about
HDFS file system. It stores each block of HDFS data in a separate file. Datanode does not
create all files in the same directory. It uses heuristics to determine optimal number of
files per directory and creates directories appropriately.
 When the filesystem starts up it generates a list of all HDFS blocks and send this report to
Namenode: Blockreport.

Reliable Storage: HDFS

Hadoop includes a fault‐tolerant storage system called the Hadoop Distributed File
System, or HDFS. HDFS is able to store huge amounts of information, scale up incrementally
and survive the failure of significant parts of the storage infrastructure without losing data.
Hadoop creates clusters of machines and coordinates work among them. Clusters can be
built with inexpensive computers. If one fails, Hadoop continues to operate the cluster
without losing data or interrupting work, by shifting work to the remaining machines in the
cluster.
HDFS manages storage on the cluster by breaking incoming files into pieces, called
“blocks,” and storing each of the blocks redundantly across the pool of servers. HDFS has
several useful features. In the very simple example shown, any two servers can fail, and the
entire file will still be available.
HDFS notices when a block or a node is lost, and creates a new copy of missing data
from the replicas it manages. Because the cluster stores several copies of every block, more
clients can read them at the same time without creating bottlenecks. Of course there are
many other redundancy techniques, including the various strategies employed by RAID
machines.
HDFS offers two key advantages over RAID: It requires no special hardware, since it
can be built from commodity servers, and can survive more kinds of failure – a disk, a node
on the network or a network interface. The one obvious objection to HDFS – its
consumption of three times the necessary storage space for the files it manages – is not so
serious, given the plummeting cost of storage.

Hadoop for Big Data Analysis

Many popular tools for enterprise data management relational database systems, for
example – are designed to make simple queries run quickly. They use techniques like indexing to
examine just a small portion of all the available data in order to answer a question. Hadoop is a
different sort of tool. Hadoop is aimed at problems that require examination of all the available
data. For example, text analysis and image processing generally require that every single record
be read, and often interpreted in the context of similar records. Hadoop uses a technique called
MapReduce to carry out this exhaustive analysis quickly.
In the previous section, we saw that HDFS distributes blocks from a single file among a
large number of servers for reliability. Hadoop takes advantage of this data distribution by
pushing the work involved in an analysis out to many different servers. Each of the servers runs
the analysis on its own block from the file.
Running the analysis on the nodes that actually store the data delivers much much better
performance than reading data over the network from a single centralized server. Hadoop
monitors jobs during execution, and will restart work lost due to node failure if necessary. In
fact, if a particular node is running very slowly, Hadoop will restart its work on another server
with a copy of the data.

SUMMERY

The institutional training has educated me on various aspects of the system


software. It is really a boon to the student’s community which paves way for the
enlightenment and enrichment of knowledge and wisdom. This training imparts to
the student’s, the analytical skill for approaching the practical problem in the
industry. In addition, it also infuses in a sense of reasoning and responsibilities the
minds of intellectuals.

This training serves as a bridge between the theoretical knowledge and the
practical implication of the same company. This training helps in bridging the gap
between understanding the conceptual knowledge and applying in a practical way.

You might also like