Bigdata
Bigdata
In Big Data analytics, the high dimensionality and the streaming nature of the
incoming data aggravate great computational challenges in data mining. Big Data
grows continually with fresh data are being generated at all times; hence it requires
an incremental computation approach which is able to monitor large scale of data
dynamically. Lightweight incremental algorithms should be considered that are
capable of achieving robustness, high accuracy and minimum pre-processing
latency. In this paper, we investigated the possibility of using a group of
incremental classification algorithm for classifying the collected data streams
pertaining to Big Data. As a case study empirical data streams were represented by
five datasets of different do-main that have very large amount of features, from
UCI archive. We compared the traditional classification model induction and their
counter-part in incremental inductions. In particular we proposed a novel
lightweight feature selection method by using Swarm Search and Accelerated PSO,
which is supposed to be useful for data stream mining. The evaluation results
showed that the incremental method obtained a higher gain in accuracy per second
incurred in the pre-processing. The contribution of this paper is a spectrum of
experimental insights for anybody who wishes to design data stream mining
applications for big data analytics using lightweight feature selection approach
such as Swarm Search and APSO. In particular, APSO is designed to be used for
data mining of data streams on the fly. The combinatorial explosion is addressed
by used swarm search approach applied in incremental manner. This approach also
fits better with real-world applications where their data arrive in streams. In
addition, an incremental data mining approach is likely to meet the demand of big
data problem in service computing.
BIG-DATA:
MAP-REDUCE has emerged as a popular and easy-to-use programming model for cloud
computing. It has been used by numerous organizations to process explosive amounts of data,
perform massive computation, and extract critical knowledge for business intelligence. Hadoop
is an open source implementation of MapReduce, currently maintained by the Apache
Foundation, and supported by leading IT companies such as Facebook and Yahoo!. Hadoop
implements MapReduce framework with two categories of components: a JobTracker and many
Task- Trackers.
The JobTracker commands TaskTrackers (a.k.a. slaves) to process data in parallel
through two main functions: map and reduce. In this process, the JobTracker is in charge of
scheduling map tasks (MapTasks) and reduce tasks (ReduceTasks) to TaskTrackers. It also
monitors their progress, collects runtime execution statistics, and handles possible faults and
errors through task reexecution. Between the two phases, a ReduceTask needs to fetch a part of
the intermediate output from all finished MapTasks. Globally, this leads to the shuffling of
intermediate data (in segments) from all MapTasks to all ReduceTasks. For many data-intensive
MapReduce programs, data shuffling can lead to a significant number of disk operations,
contending for the limited I/O bandwidth.
This presents a severe problem of disk I/O contention in MapReduce programs, which
entails further research on efficient data shuffling and merging algorithms. proposed the
MapReduce Online architecture to open up direct network channels between MapTasks and
ReduceTasks and speed up the delivery of data from MapTasks to ReduceTasks. It remains as a
critical issue to examine the relationship of Hadoop MapReduce’s three data processing phases,
i.e., shuffle, merge, and reduce, and their implication to the efficiency of Hadoop. With an
extensive examination of Hadoop MapReduce framework, particularly its ReduceTasks, we
reveal that the original architecture faces a number of challenging issues to exploit the best
performance from the underlying system. To ensure the correctness of MapReduce, no
ReduceTasks can start reducing data until all intermediate data have been merged together. This
results in a serialization barrier that significantly delays the reduce operation of ReduceTasks.
More importantly, the current merge algorithm in Hadoop merges intermediate data segments
from MapTasks when the number of available segments (including those that are already
merged) goes over a threshold.
These segments are spilled to local disk storage when their total size is bigger than the
available memory. This algorithm causes data segments to be merged repetitively and, therefore,
multiple rounds of disk accesses of the same data. To address these critical issues for Hadoop
MapReduce framework, we have designed Hadoop-A, a portable acceleration framework that
can take advantage of plug-in components for performance enhancement and protocol
optimizations. Several enhancements are introduced: 1) a novel algorithm that enables
ReduceTasks to perform data merging without repetitive merges and extra disk accesses; 2) a full
pipeline is designed to overlap the shuffle, merge, and reduce phases for ReduceTasks; and 3) a
portable implementation of Hadoop-A that can support both TCP/ IP and remote direct memory
access (RDMA).
Hadoop File System
Highly fault-tolerant
High throughput
Fault tolerance:
Failure is the norm rather than exception A HDFS instance may consist of thousands of
server machines, each storing part of the file system’s data. Since we have huge number of
components and that each component has non-trivial probability of failure means that there is
always some component that is non-functional. Detection of faults and quick, automatic
recovery from them is a core architectural goal of HDFS.
Data Characteristics:
Write-once-read-many: a file once created, written and closed need not be changed – this
assumption simplifies coherency
reduce
combine
part
map
split 0
reduce
combine
part
map
split 1
Cat
Bat
Dog reduce
combine
Other map part
split
Words 2
(size:
TByte)
map
split
Namenode
Metadata(Name,
Metadat
a ops replicas..)
Client (/home/foo/data
Block,6. ..
Re ops Datan
Datan
ad odes odes
replica B
tion
Blo
cks
Ra Wr Ra
ck1 ite ck2
Client
Data Replication:
HDFS is designed to store very large files across machines in a large cluster. Each file is
a sequence of blocks. All blocks in the file except the last are of the same size. Blocks are
replicated for fault tolerance. Block size and replicas are configurable per file. The
NameNode receives a Heartbeat and a Block Report from each DataNode in the cluster.
Block Report contains all the blocks on a Datanode.
Replica Placement:
The placement of the replicas is critical to HDFS reliability and performance. Optimizing
replica placement distinguishes HDFS from other distributed file systems. Rack-aware
replica placement:
Goal: improve reliability, availability and network bandwidth utilization
Research topic
Many racks, communication between racks are through switches. Network bandwidth
between machines on the same rack is greater than those in different racks. NameNode
determines the rack id for each DataNode. Replicas are typically placed on unique racks
Simple but non-optimal
Writes are expensive
Replication factor is 3
Another research topic?
Replicas are placed: one on a node in a local rack, one on a different node in the local
rack and one on a node in a different rack. 1/3 of the replica on a node, 2/3 on a rack and
1/3 distributed evenly across remaining racks.
NameNode:
Keeps image of entire file system namespace and file Blockmap in memory. 4GB of local
RAM is sufficient to support the above data structures that represent the huge number of
files and directories. When the Namenode starts up it gets the FsImage and Editlog from
its local file system, update FsImage with EditLog information and then stores a copy of
the FsImage on the filesytstem as a checkpoint. Periodic checkpointing is done. So that
the system can recover back to the last checkpointed state in case of a crash.
Datanode:
A Datanode stores data in files in its local file system. Datanode has no knowledge about
HDFS file system. It stores each block of HDFS data in a separate file. Datanode does not
create all files in the same directory. It uses heuristics to determine optimal number of
files per directory and creates directories appropriately.
When the filesystem starts up it generates a list of all HDFS blocks and send this report to
Namenode: Blockreport.
Hadoop includes a fault‐tolerant storage system called the Hadoop Distributed File
System, or HDFS. HDFS is able to store huge amounts of information, scale up incrementally
and survive the failure of significant parts of the storage infrastructure without losing data.
Hadoop creates clusters of machines and coordinates work among them. Clusters can be
built with inexpensive computers. If one fails, Hadoop continues to operate the cluster
without losing data or interrupting work, by shifting work to the remaining machines in the
cluster.
HDFS manages storage on the cluster by breaking incoming files into pieces, called
“blocks,” and storing each of the blocks redundantly across the pool of servers. HDFS has
several useful features. In the very simple example shown, any two servers can fail, and the
entire file will still be available.
HDFS notices when a block or a node is lost, and creates a new copy of missing data
from the replicas it manages. Because the cluster stores several copies of every block, more
clients can read them at the same time without creating bottlenecks. Of course there are
many other redundancy techniques, including the various strategies employed by RAID
machines.
HDFS offers two key advantages over RAID: It requires no special hardware, since it
can be built from commodity servers, and can survive more kinds of failure – a disk, a node
on the network or a network interface. The one obvious objection to HDFS – its
consumption of three times the necessary storage space for the files it manages – is not so
serious, given the plummeting cost of storage.
Many popular tools for enterprise data management relational database systems, for
example – are designed to make simple queries run quickly. They use techniques like indexing to
examine just a small portion of all the available data in order to answer a question. Hadoop is a
different sort of tool. Hadoop is aimed at problems that require examination of all the available
data. For example, text analysis and image processing generally require that every single record
be read, and often interpreted in the context of similar records. Hadoop uses a technique called
MapReduce to carry out this exhaustive analysis quickly.
In the previous section, we saw that HDFS distributes blocks from a single file among a
large number of servers for reliability. Hadoop takes advantage of this data distribution by
pushing the work involved in an analysis out to many different servers. Each of the servers runs
the analysis on its own block from the file.
Running the analysis on the nodes that actually store the data delivers much much better
performance than reading data over the network from a single centralized server. Hadoop
monitors jobs during execution, and will restart work lost due to node failure if necessary. In
fact, if a particular node is running very slowly, Hadoop will restart its work on another server
with a copy of the data.
SUMMERY
This training serves as a bridge between the theoretical knowledge and the
practical implication of the same company. This training helps in bridging the gap
between understanding the conceptual knowledge and applying in a practical way.