0% found this document useful (0 votes)

30 views9 pages

Bigdata

bbbbbbbbbbbbbbbbbbkjjj jklllllllllllllllllllllllllllllllllll

Uploaded by

sumanice

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views9 pages

Bigdata

bbbbbbbbbbbbbbbbbbkjjj jklllllllllllllllllllllllllllllllllll

Uploaded by

sumanice

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 9

Conclusion:

In Big Data analytics, the high dimensionality and the streaming nature of the
incoming data aggravate great computational challenges in data mining. Big Data
grows continually with fresh data are being generated at all times; hence it requires
an incremental computation approach which is able to monitor large scale of data
dynamically. Lightweight incremental algorithms should be considered that are
capable of achieving robustness, high accuracy and minimum pre-processing
latency. In this paper, we investigated the possibility of using a group of
incremental classification algorithm for classifying the collected data streams
pertaining to Big Data. As a case study empirical data streams were represented by
five datasets of different do-main that have very large amount of features, from
UCI archive. We compared the traditional classification model induction and their
counter-part in incremental inductions. In particular we proposed a novel
lightweight feature selection method by using Swarm Search and Accelerated PSO,
which is supposed to be useful for data stream mining. The evaluation results
showed that the incremental method obtained a higher gain in accuracy per second
incurred in the pre-processing. The contribution of this paper is a spectrum of
experimental insights for anybody who wishes to design data stream mining
applications for big data analytics using lightweight feature selection approach
such as Swarm Search and APSO. In particular, APSO is designed to be used for
data mining of data streams on the fly. The combinatorial explosion is addressed
by used swarm search approach applied in incremental manner. This approach also
fits better with real-world applications where their data arrive in streams. In
addition, an incremental data mining approach is likely to meet the demand of big
data problem in service computing.
BIG-DATA:

MAP-REDUCE has emerged as a popular and easy-to-use programming model for cloud
computing. It has been used by numerous organizations to process explosive amounts of data,
perform massive computation, and extract critical knowledge for business intelligence. Hadoop
is an open source implementation of MapReduce, currently maintained by the Apache
Foundation, and supported by leading IT companies such as Facebook and Yahoo!. Hadoop
implements MapReduce framework with two categories of components: a JobTracker and many
Task- Trackers.
The JobTracker commands TaskTrackers (a.k.a. slaves) to process data in parallel
through two main functions: map and reduce. In this process, the JobTracker is in charge of
scheduling map tasks (MapTasks) and reduce tasks (ReduceTasks) to TaskTrackers. It also
monitors their progress, collects runtime execution statistics, and handles possible faults and
errors through task reexecution. Between the two phases, a ReduceTask needs to fetch a part of
the intermediate output from all finished MapTasks. Globally, this leads to the shuffling of
intermediate data (in segments) from all MapTasks to all ReduceTasks. For many data-intensive
MapReduce programs, data shuffling can lead to a significant number of disk operations,
contending for the limited I/O bandwidth.
This presents a severe problem of disk I/O contention in MapReduce programs, which
entails further research on efficient data shuffling and merging algorithms. proposed the
MapReduce Online architecture to open up direct network channels between MapTasks and
ReduceTasks and speed up the delivery of data from MapTasks to ReduceTasks. It remains as a
critical issue to examine the relationship of Hadoop MapReduce’s three data processing phases,
i.e., shuffle, merge, and reduce, and their implication to the efficiency of Hadoop. With an
extensive examination of Hadoop MapReduce framework, particularly its ReduceTasks, we
reveal that the original architecture faces a number of challenging issues to exploit the best
performance from the underlying system. To ensure the correctness of MapReduce, no
ReduceTasks can start reducing data until all intermediate data have been merged together. This
results in a serialization barrier that significantly delays the reduce operation of ReduceTasks.
More importantly, the current merge algorithm in Hadoop merges intermediate data segments
from MapTasks when the number of available segments (including those that are already
merged) goes over a threshold.
These segments are spilled to local disk storage when their total size is bigger than the
available memory. This algorithm causes data segments to be merged repetitively and, therefore,
multiple rounds of disk accesses of the same data. To address these critical issues for Hadoop
MapReduce framework, we have designed Hadoop-A, a portable acceleration framework that
can take advantage of plug-in components for performance enhancement and protocol
optimizations. Several enhancements are introduced: 1) a novel algorithm that enables
ReduceTasks to perform data merging without repetitive merges and extra disk accesses; 2) a full
pipeline is designed to overlap the shuffle, merge, and reduce phases for ReduceTasks; and 3) a
portable implementation of Hadoop-A that can support both TCP/ IP and remote direct memory
access (RDMA).
Hadoop File System
 Highly fault-tolerant

 High throughput

 Suitable for applications with large data sets

 Streaming access to file system data

 Can be built out of commodity hardware

Fault tolerance:
Failure is the norm rather than exception A HDFS instance may consist of thousands of
server machines, each storing part of the file system’s data. Since we have huge number of
components and that each component has non-trivial probability of failure means that there is
always some component that is non-functional. Detection of faults and quick, automatic
recovery from them is a core architectural goal of HDFS.

Data Characteristics:

 Streaming data access.

 Applications need streaming access to data

 Batch processing rather than interactive user access.

 Large data sets and files: gigabytes to terabytes size

 High aggregate data bandwidth

 Scale to hundreds of nodes in a cluster

 Tens of millions of files in a single instance

 Write-once-read-many: a file once created, written and closed need not be changed – this
assumption simplifies coherency

 A map-reduce application or web-crawler application fits perfectly with this model.

MapReduce:

reduce

combine
part
map
split 0
reduce

combine
part
map
split 1
Cat
Bat
Dog reduce

combine
Other map part
split
Words 2
(size:
TByte)
map
split

Key Features of MapReduce Model:

• Designed for clouds
• Large clusters of commodity machines
• Designed for big data
• Support from local disks based distributed file system (GFS / HDFS)
• Disk based intermediate data transfer in Shuffling
• MapReduce programming model
• Computation pattern: Map tasks and Reduce tasks
• Data abstraction: KeyValue pairs
HDFS Architecture:

Namenode
Metadata(Name,
Metadat
a ops replicas..)
Client (/home/foo/data
Block,6. ..
Re ops Datan
Datan
ad odes odes

replica B
tion
Blo
cks

Ra Wr Ra
ck1 ite ck2
Client

File system Namespace:

 Hierarchical file system with directories and files Create, remove, move, rename etc.
NameNode maintains the file system any Meta information changes to the file system
recorded by the NameNode. An application can specify the number of replicas of the file
needed: replication factor of the file. This information is stored in the NameNode.

Data Replication:
 HDFS is designed to store very large files across machines in a large cluster. Each file is
a sequence of blocks. All blocks in the file except the last are of the same size. Blocks are
replicated for fault tolerance. Block size and replicas are configurable per file. The
NameNode receives a Heartbeat and a Block Report from each DataNode in the cluster.
Block Report contains all the blocks on a Datanode.
Replica Placement:
 The placement of the replicas is critical to HDFS reliability and performance. Optimizing
replica placement distinguishes HDFS from other distributed file systems. Rack-aware
replica placement:
 Goal: improve reliability, availability and network bandwidth utilization
 Research topic
 Many racks, communication between racks are through switches. Network bandwidth
between machines on the same rack is greater than those in different racks. NameNode
determines the rack id for each DataNode. Replicas are typically placed on unique racks
 Simple but non-optimal
 Writes are expensive
 Replication factor is 3
 Another research topic?
 Replicas are placed: one on a node in a local rack, one on a different node in the local
rack and one on a node in a different rack. 1/3 of the replica on a node, 2/3 on a rack and
1/3 distributed evenly across remaining racks.

NameNode:
 Keeps image of entire file system namespace and file Blockmap in memory. 4GB of local
RAM is sufficient to support the above data structures that represent the huge number of
files and directories. When the Namenode starts up it gets the FsImage and Editlog from
its local file system, update FsImage with EditLog information and then stores a copy of
the FsImage on the filesytstem as a checkpoint. Periodic checkpointing is done. So that
the system can recover back to the last checkpointed state in case of a crash.

Datanode:

 A Datanode stores data in files in its local file system. Datanode has no knowledge about
HDFS file system. It stores each block of HDFS data in a separate file. Datanode does not
create all files in the same directory. It uses heuristics to determine optimal number of
files per directory and creates directories appropriately.
 When the filesystem starts up it generates a list of all HDFS blocks and send this report to
Namenode: Blockreport.

Reliable Storage: HDFS

Hadoop includes a fault‐tolerant storage system called the Hadoop Distributed File
System, or HDFS. HDFS is able to store huge amounts of information, scale up incrementally
and survive the failure of significant parts of the storage infrastructure without losing data.
Hadoop creates clusters of machines and coordinates work among them. Clusters can be
built with inexpensive computers. If one fails, Hadoop continues to operate the cluster
without losing data or interrupting work, by shifting work to the remaining machines in the
cluster.
HDFS manages storage on the cluster by breaking incoming files into pieces, called
“blocks,” and storing each of the blocks redundantly across the pool of servers. HDFS has
several useful features. In the very simple example shown, any two servers can fail, and the
entire file will still be available.
HDFS notices when a block or a node is lost, and creates a new copy of missing data
from the replicas it manages. Because the cluster stores several copies of every block, more
clients can read them at the same time without creating bottlenecks. Of course there are
many other redundancy techniques, including the various strategies employed by RAID
machines.
HDFS offers two key advantages over RAID: It requires no special hardware, since it
can be built from commodity servers, and can survive more kinds of failure – a disk, a node
on the network or a network interface. The one obvious objection to HDFS – its
consumption of three times the necessary storage space for the files it manages – is not so
serious, given the plummeting cost of storage.

Hadoop for Big Data Analysis

Many popular tools for enterprise data management relational database systems, for
example – are designed to make simple queries run quickly. They use techniques like indexing to
examine just a small portion of all the available data in order to answer a question. Hadoop is a
different sort of tool. Hadoop is aimed at problems that require examination of all the available
data. For example, text analysis and image processing generally require that every single record
be read, and often interpreted in the context of similar records. Hadoop uses a technique called
MapReduce to carry out this exhaustive analysis quickly.
In the previous section, we saw that HDFS distributes blocks from a single file among a
large number of servers for reliability. Hadoop takes advantage of this data distribution by
pushing the work involved in an analysis out to many different servers. Each of the servers runs
the analysis on its own block from the file.
Running the analysis on the nodes that actually store the data delivers much much better
performance than reading data over the network from a single centralized server. Hadoop
monitors jobs during execution, and will restart work lost due to node failure if necessary. In
fact, if a particular node is running very slowly, Hadoop will restart its work on another server
with a copy of the data.

SUMMERY

The institutional training has educated me on various aspects of the system

software. It is really a boon to the student’s community which paves way for the
enlightenment and enrichment of knowledge and wisdom. This training imparts to
the student’s, the analytical skill for approaching the practical problem in the
industry. In addition, it also infuses in a sense of reasoning and responsibilities the
minds of intellectuals.

This training serves as a bridge between the theoretical knowledge and the
practical implication of the same company. This training helps in bridging the gap
between understanding the conceptual knowledge and applying in a practical way.

Big Data & Hadoop Training Material 0 1 PDF
50% (2)
Big Data & Hadoop Training Material 0 1 PDF
168 pages
Oracle DataGuard Physical Standby Installation Step by Step Using RMAN
No ratings yet
Oracle DataGuard Physical Standby Installation Step by Step Using RMAN
10 pages
Data Processing For Large Database Using Mapreduce Approach Using Apso
No ratings yet
Data Processing For Large Database Using Mapreduce Approach Using Apso
59 pages
Splits Input Into Independent Chunks in Parallel Manner
No ratings yet
Splits Input Into Independent Chunks in Parallel Manner
4 pages
Mapreduce: Simplified Data Processing On Large Clusters
No ratings yet
Mapreduce: Simplified Data Processing On Large Clusters
38 pages
Cheat Sheet 1
No ratings yet
Cheat Sheet 1
2 pages
A Distributed File System-1
No ratings yet
A Distributed File System-1
65 pages
9 Hadoop PDF
No ratings yet
9 Hadoop PDF
59 pages
Unit-Iv CC&BD CS62
No ratings yet
Unit-Iv CC&BD CS62
76 pages
U-3 Big Data
No ratings yet
U-3 Big Data
23 pages
02 Unit-II Hadoop Architecture and HDFS
No ratings yet
02 Unit-II Hadoop Architecture and HDFS
18 pages
Week 02
No ratings yet
Week 02
115 pages
Big-Data Final
No ratings yet
Big-Data Final
7 pages
Big Data Lecture Presentation
No ratings yet
Big Data Lecture Presentation
28 pages
BG 345
No ratings yet
BG 345
26 pages
Chapter 2
No ratings yet
Chapter 2
19 pages
Bda Mod 1
No ratings yet
Bda Mod 1
32 pages
Hdfs Part 2
No ratings yet
Hdfs Part 2
42 pages
Fillatre Big Data
No ratings yet
Fillatre Big Data
98 pages
Unit 1,2,3,4
No ratings yet
Unit 1,2,3,4
116 pages
Hadoop: A Seminar Report On
No ratings yet
Hadoop: A Seminar Report On
28 pages
DW - Bigdata9
No ratings yet
DW - Bigdata9
113 pages
A Brief On MapReduce Performance
No ratings yet
A Brief On MapReduce Performance
6 pages
HCIA Big Data
No ratings yet
HCIA Big Data
20 pages
BiG DaTa
100% (1)
BiG DaTa
9 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
48 pages
CST322 Module4 Part3 Hadoop
No ratings yet
CST322 Module4 Part3 Hadoop
45 pages
Hadoop Presentation
No ratings yet
Hadoop Presentation
19 pages
3.1 Hadoop Ecosystem
No ratings yet
3.1 Hadoop Ecosystem
48 pages
B. Hadoop Ecosystem - III (MapReduce)
No ratings yet
B. Hadoop Ecosystem - III (MapReduce)
55 pages
Design An Efficient Big Data Analytic Architecture For Retrieval of Data Based On Web Server in Cloud Environment
No ratings yet
Design An Efficient Big Data Analytic Architecture For Retrieval of Data Based On Web Server in Cloud Environment
10 pages
10th August Morning and Afternoon Session Hadoop
No ratings yet
10th August Morning and Afternoon Session Hadoop
18 pages
Unit V Programming Model
No ratings yet
Unit V Programming Model
53 pages
4
No ratings yet
4
53 pages
Introduction To Hadoop Ecosystem
No ratings yet
Introduction To Hadoop Ecosystem
46 pages
Bda 2
No ratings yet
Bda 2
35 pages
Fidoop
No ratings yet
Fidoop
5 pages
BDA Lec5
No ratings yet
BDA Lec5
40 pages
Big Data Computing
No ratings yet
Big Data Computing
36 pages
Big-Data Computing: Hadoop Distributed File System: B. Ramamurthy
No ratings yet
Big-Data Computing: Hadoop Distributed File System: B. Ramamurthy
45 pages
Part2 HDFS
No ratings yet
Part2 HDFS
33 pages
DSCC Unit 5 PDF
No ratings yet
DSCC Unit 5 PDF
8 pages
CS19741-Cloud Computing-Unit 3 Notes
No ratings yet
CS19741-Cloud Computing-Unit 3 Notes
37 pages
Hadoop Intro and Hdfs
No ratings yet
Hadoop Intro and Hdfs
37 pages
Hadoop - The Final Product
100% (2)
Hadoop - The Final Product
42 pages
What Are Basic Characteristics of Data and How Is Parallel Processing System Different From Distributed System?
No ratings yet
What Are Basic Characteristics of Data and How Is Parallel Processing System Different From Distributed System?
24 pages
What Are Basic Characteristics of Data and How Is Parallel Processing System Different From Distributed System?
No ratings yet
What Are Basic Characteristics of Data and How Is Parallel Processing System Different From Distributed System?
24 pages
Bda Answer Key
No ratings yet
Bda Answer Key
5 pages
The CAP Theorem Overview
No ratings yet
The CAP Theorem Overview
16 pages
Big-Data Unit-4
No ratings yet
Big-Data Unit-4
10 pages
.Analysis and Processing of Massive Data Based On Hadoop Platform A Perusal of Big Data Classification and Hadoop Technology
No ratings yet
.Analysis and Processing of Massive Data Based On Hadoop Platform A Perusal of Big Data Classification and Hadoop Technology
3 pages
Apex Institute of Technology: Big Data Security
No ratings yet
Apex Institute of Technology: Big Data Security
30 pages
Big Data Unit 2
No ratings yet
Big Data Unit 2
25 pages
Bda Ia1 Scheme
No ratings yet
Bda Ia1 Scheme
7 pages
Analysis of Dynamic Data Placement Strategy For Heterogeneous Hadoop Cluster
No ratings yet
Analysis of Dynamic Data Placement Strategy For Heterogeneous Hadoop Cluster
8 pages
Hadoop Intro
No ratings yet
Hadoop Intro
40 pages
Unit-2 Introduction To Hadoop
No ratings yet
Unit-2 Introduction To Hadoop
19 pages
Introduction To Hadoop - Chapter-2
No ratings yet
Introduction To Hadoop - Chapter-2
59 pages
Efficient Parallel Computing with Dask: Definitive Reference for Developers and Engineers
From Everand
Efficient Parallel Computing with Dask: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Google BigQuery Analytics
From Everand
Google BigQuery Analytics
Jordan Tigani
3/5 (1)
Parallel Python with Dask
From Everand
Parallel Python with Dask
Tim Peters
No ratings yet
ATM
No ratings yet
ATM
4 pages
Power Generation Using Windmill From Vehicle Movement in Highways and Use of Smart Solar Tracking System With Intelligent Lighting Control Project Reference No.: 40S - Be - 0640
No ratings yet
Power Generation Using Windmill From Vehicle Movement in Highways and Use of Smart Solar Tracking System With Intelligent Lighting Control Project Reference No.: 40S - Be - 0640
3 pages
Final Documentation
No ratings yet
Final Documentation
10 pages
Connecting Farmers and Industry Using Android Mobile: System Requirement
No ratings yet
Connecting Farmers and Industry Using Android Mobile: System Requirement
2 pages
Context-Based Access Control Systems For Mobile Devices
No ratings yet
Context-Based Access Control Systems For Mobile Devices
1 page
16×2 LCD General Description
No ratings yet
16×2 LCD General Description
2 pages
Third Eye For The Blind Using Arduino and Ultrasonic Sensors
No ratings yet
Third Eye For The Blind Using Arduino and Ultrasonic Sensors
5 pages
1 ST Review Document
No ratings yet
1 ST Review Document
37 pages
Company Profile-Cosmic
No ratings yet
Company Profile-Cosmic
3 pages
05 - Assignment Module DDD
No ratings yet
05 - Assignment Module DDD
39 pages
Data Modeling
No ratings yet
Data Modeling
16 pages
Module 8
No ratings yet
Module 8
37 pages
Designing Program Logic
No ratings yet
Designing Program Logic
16 pages
Master Tableau 1708884896
No ratings yet
Master Tableau 1708884896
17 pages
Mil HDBK 1908
No ratings yet
Mil HDBK 1908
45 pages
Flashback Technology Provides A Set of Features To View and Rewind Data Back and Forth in Time
No ratings yet
Flashback Technology Provides A Set of Features To View and Rewind Data Back and Forth in Time
12 pages
Ananda Vidya Sagar KATAKAM - MIS - Week 7 Assignment - Presentation
No ratings yet
Ananda Vidya Sagar KATAKAM - MIS - Week 7 Assignment - Presentation
26 pages
Data Science Unit-I
No ratings yet
Data Science Unit-I
13 pages
Unit - I Distributed Data Processing
100% (2)
Unit - I Distributed Data Processing
27 pages
Azure Book 126
No ratings yet
Azure Book 126
1 page
Ip-Class12 - Practical File - Database Query Using Mysql
No ratings yet
Ip-Class12 - Practical File - Database Query Using Mysql
2 pages
Frequent Pattern Mining
No ratings yet
Frequent Pattern Mining
2 pages
Internship Training Report by Vidhita Jain
No ratings yet
Internship Training Report by Vidhita Jain
44 pages
Lecture 6 Acc Software and ERP
No ratings yet
Lecture 6 Acc Software and ERP
34 pages
Implementation: 4.1 Component Modules
No ratings yet
Implementation: 4.1 Component Modules
10 pages
Parallel and Distributed Databases
No ratings yet
Parallel and Distributed Databases
7 pages
Node JS
No ratings yet
Node JS
4 pages
Utl File Package
No ratings yet
Utl File Package
5 pages
Install Zabbix 6
No ratings yet
Install Zabbix 6
7 pages
Transaction Processing System
No ratings yet
Transaction Processing System
3 pages
Subscriber Definition Document Excalibur Feeds Phase1
No ratings yet
Subscriber Definition Document Excalibur Feeds Phase1
8 pages
70 461 SampleQuestions
No ratings yet
70 461 SampleQuestions
126 pages
Fourth Generation Techniques: Prepared By: Arvin C. Libutan Mba / Tcu MIS / Mon 6 - 9pm
No ratings yet
Fourth Generation Techniques: Prepared By: Arvin C. Libutan Mba / Tcu MIS / Mon 6 - 9pm
13 pages
Ss2 Data Processing 2nd Term
0% (1)
Ss2 Data Processing 2nd Term
33 pages
AJ Final Internship Report
No ratings yet
AJ Final Internship Report
67 pages
Chapter 7
No ratings yet
Chapter 7
28 pages
Week 6
No ratings yet
Week 6
135 pages
Pig
No ratings yet
Pig
12 pages

Bigdata

Uploaded by

Bigdata

Uploaded by

Conclusion:

 Suitable for applications with large data sets

 Streaming access to file system data

 Can be built out of commodity hardware

 Streaming data access.

 Applications need streaming access to data

 Batch processing rather than interactive user access.

 Large data sets and files: gigabytes to terabytes size

 High aggregate data bandwidth

 Scale to hundreds of nodes in a cluster

 Tens of millions of files in a single instance

 A map-reduce application or web-crawler application fits perfectly with this model.

Key Features of MapReduce Model:

File system Namespace:

Reliable Storage: HDFS

Hadoop for Big Data Analysis

The institutional training has educated me on various aspects of the system

You might also like