0% found this document useful (0 votes)
57 views28 pages

Seminar Big Data Hadoop

The document discusses big data and the Hadoop framework. It defines big data as extremely large data sets that cannot be processed by traditional data management tools due to their size and complexity. Big data comes in structured, unstructured, and semi-structured forms. The Hadoop framework allows for the distributed storage and processing of big data across clusters of computers. Hadoop uses HDFS for storage and MapReduce as its programming model to process data in parallel across nodes.

Uploaded by

Moeenuddin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views28 pages

Seminar Big Data Hadoop

The document discusses big data and the Hadoop framework. It defines big data as extremely large data sets that cannot be processed by traditional data management tools due to their size and complexity. Big data comes in structured, unstructured, and semi-structured forms. The Hadoop framework allows for the distributed storage and processing of big data across clusters of computers. Hadoop uses HDFS for storage and MapReduce as its programming model to process data in parallel across nodes.

Uploaded by

Moeenuddin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 28

BIG DATA

BIG DATA & HADOOP

INTERNAL GUIDE : TASNEEM MAM


2/21/2019
BIG DATA IQRA BCA COLLEGE

BIG DATA
What is Data?
The quantities, characters, or symbols on which operations are performed by a
computer, which may be stored and transmitted in the form of electrical signals and
recorded on magnetic, optical, or mechanical recording media.

What is Big Data?


Big Data is also data but with a huge size. Big Data is a term used to describe a
collection of data that is huge in size and yet growing exponentially with time. In
short such data is so large and complex that none of the traditional data management
tools are able to store it or process it efficiently.

BigData' could be found in three forms:

1. Structured
2. Unstructured
3. Semi-structured

Structured
Any data that can be stored, accessed and processed in the form of fixed format is
termed as a 'structured' data. Over the period of time, talent in computer science
has achieved greater success in developing techniques for working with such kind
of data (where the format is well known in advance) and also deriving value out of
it. However, nowadays, we are foreseeing issues when a size of such data grows to
a huge extent, typical sizes are being in the rage of multiple zettabytes.

Do you know? 1021 bytes equal to 1 zettabyte or one billion terabytes forms a
zettabyte.

Looking at these figures one can easily understand why the name Big Data is given
and imagine the challenges involved in its storage and processing.

Do you know? Data stored in a relational database management system is one


example of a 'structured' data.

Examples Of Structured Data

PATEL MOEEN M. (16BCA24) 2|Page


BIG DATA IQRA BCA COLLEGE

An 'Employee' table in a database is an example of Structured Data

Employee_ID Employee_Name Gender Department Salary_In_lacs

2365 Rajesh Kulkarni Male Finance 650000

3398 Pratibha Joshi Female Admin 650000

7465 Shushil Roy Male Admin 500000

7500 Shubhojit Das Male Finance 500000

7699 Priya Sane Female Finance 550000

Unstructured
Any data with unknown form or the structure is classified as unstructured data. In
addition to the size being huge, un-structured data poses multiple challenges in
terms of its processing for deriving value out of it. A typical example of
unstructured data is a heterogeneous data source containing a combination of
simple text files, images, videos etc. Now day organizations have wealth of data
available with them but unfortunately, they don't know how to derive value out of
it since this data is in its raw form or unstructured format.

Examples Of Un-structured Data

The output returned by 'Google Search'

PATEL MOEEN M. (16BCA24) 3|Page


BIG DATA IQRA BCA COLLEGE

Semi-structured
Semi-structured data can contain both the forms of data. We can see semi-
structured data as a structured in form but it is actually not defined with e.g. a table
definition in relational DBMS. Example of semi-structured data is a data
represented in an XML file.

Examples Of Semi-structured Data

Personal data stored in an XML file-

<rec><name>Prashant Rao</name><sex>Male</sex><age>35</age></rec>
<rec><name>Seema R.</name><sex>Female</sex><age>41</age></rec>
<rec><name>Satish Mane</name><sex>Male</sex><age>29</age></rec>
<rec><name>Subrato Roy</name><sex>Male</sex><age>26</age></rec>
<rec><name>Jeremiah J.</name><sex>Male</sex><age>35</age></rec>

Data Growth over the years

PATEL MOEEN M. (16BCA24) 4|Page


BIG DATA IQRA BCA COLLEGE

Please note that web application data, which is unstructured, consists of log files,
transaction history files etc. OLTP systems are built to work with structured data
wherein data is stored in relations (tables).

(i) Volume – The name Big Data itself is related to a size which is enormous. Size
of data plays a very crucial role in determining value out of data. Also, whether a
particular data can actually be considered as a Big Data or not, is dependent upon
the volume of data. Hence, 'Volume' is one characteristic which needs to be
considered while dealing with Big Data.

(ii) Variety – The next aspect of Big Data is its variety.

Variety refers to heterogeneous sources and the nature of data, both structured and
unstructured. During earlier days, spreadsheets and databases were the only
sources of data considered by most of the applications. Nowadays, data in the form
of emails, photos, videos, monitoring devices, PDFs, audio, etc. are also being
considered in the analysis applications. This variety of unstructured data poses
certain issues for storage, mining and analyzing data.

(iii) Velocity – The term 'velocity' refers to the speed of generation of data. How
fast the data is generated and processed to meet the demands, determines real
potential in the data.

Big Data Velocity deals with the speed at which data flows in from sources like
business processes, application logs, networks, and social media sites,
sensors, Mobile devices, etc. The flow of data is massive and continuous.

PATEL MOEEN M. (16BCA24) 5|Page


BIG DATA IQRA BCA COLLEGE

(iv) Variability – This refers to the inconsistency which can be shown by the data
at times, thus hampering the process of being able to handle and manage the data
effectively.

The objectives of big data are:


 Understanding and Targeting Customers: Big data helps an
organization understand its customers better, and helps it
narrow down the target audience, thus improving their
marketing campaign.
 Taking Strategic Decisions: With big data, businesses can
take data-backed decisions. No need to fumble in the dark,
when the huge volume of data provides practically every bit
of information about your business, market, customers,
industry, and competition.
 Cost Optimization: Costs can be better optimized, when you
have the data to know which elements are draining costs but
not returning high value. For instance, the healthcare sector
can utilize big data to find the cause of cost hike of healthcare
facilities.
 Improving Customer Experiences: Take the case of retail.
Big data can help retailers & wholesalers show customer
reviews about the product’s quality & delivery time, thus
improving customer experiences in the retail buying process.

Technical objectives
 To roll-out improved intelligence conveying cross-sector and
multi-lingual tools, turning the Big Data 4Vs (Volume, Variety,
Veracity, Velocity) into Value
 To deliver an open, secure, privacy-respectful, configurable,
scalable cloud based Big Data infrastructure as a Service
benefiting all actors in the value chain

PATEL MOEEN M. (16BCA24) 6|Page


BIG DATA IQRA BCA COLLEGE

Hadoop Framwork.
What Is Hadoop?
Hadoop is a framework that allows you to first store Big Data in a distributed
environment, so that, you can process it parallely.
Hadoop is an open source software programming framework for storing a large amount
of data and performing the computation. Its framework is based on Java programming

Hadoop is a collection of open-source software utilities that facilitate using a network of


many computers to solve problems involving massive amounts of data and
computation. It provides a software framework for distributed storage and processing
of big data using the MapReduce programming model. Originally designed for computer
clusters built from commodity hardware—still the common use—it has also found use
on clusters of higher-end hardware. All the modules in Hadoop are designed with a
fundamental assumption that hardware failures are common occurrences and should be
automatically handled by the framework.
The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed
File System (HDFS), and a processing part which is a MapReduce programming model.
Hadoop splits files into large blocks and distributes them across nodes in a cluster. It
then transfers packaged code into nodes to process the data in parallel. This approach
takes advantage of data locality, where nodes manipulate the data they have access to.
This allows the dataset to be processed faster and more efficiently than it would be in a
more conventional supercomputer architecture that relies on a parallel file system where
computation and data are distributed via high-speed networking.
.

Priminary Survay.
Hadoop was started with Doug Cutting and Mike Cafarella in the year 2002 when they
both started to work on Apache Nutch project. Apache Nutch project was the process of
building a search engine system that can index 1 billion pages. After a lot of research on
Nutch, they concluded that such a system will cost around half a million dollars in
hardware, and along with a monthly running cost of $30, 000 approximately, which is
very expensive. So, they realized that their project architecture will not be capable
enough to the workaround with billions of pages on the web. So they were looking for a
feasible solution which can reduce the implementation cost as well as the problem of
storing and processing of large datasets.

In 2003, they came across a paper that described the architecture of Google’s
distributed file system, called GFS (Google File System) which was published by

PATEL MOEEN M. (16BCA24) 7|Page


BIG DATA IQRA BCA COLLEGE

Google, for storing the large data sets. Now they realize that this paper can solve their
problem of storing very large files which were being generated because of web crawling
and indexing processes. But this paper was just the half solution to their problem.

In 2004, Google published one more paper on the technique MapReduce, which was
the solution of processing those large datasets. Now this paper was another half
solution for Doug Cutting and Mike Cafarella for their Nutch project. These both
techniques (GFS & MapReduce) were just on white paper at Google. Google didn’t
implement these two techniques. Doug Cutting knew from his work on Apache Lucene (
It is a free and open-source information retrieval software library, originally written in
Java by Doug Cutting in 1999) that open-source is a great way to spread the technology
to more people. So, together with Mike Cafarella, he started implementing Google’s
techniques (GFS & MapReduce) as open-source in the Apache Nutch project.

In 2005, Cutting found that Nutch is limited to only 20-to-40 node clusters. He soon
realized two problems:
(a) Nutch wouldn’t achieve its potential until it ran reliably on the larger clusters
(b) And that was looking impossible with just two people (Doug Cutting & Mike
Cafarella).
The engineering task in Nutch project was much bigger than he realized. So he started
to find a job with a company who is interested in investing in their efforts. And he found
Yahoo!.Yahoo had a large team of engineers that was eager to work on this there
project.

So in 2006, Doug Cutting joined Yahoo along with Nutch project. He wanted to provide
the world with an open-source, reliable, scalable computing framework, with the help of
Yahoo. So at Yahoo first, he separates the distributed computing parts from Nutch
and formed a new project Hadoop (He gave name Hadoop it was the name of a
yellow toy elephant which was owned by the Doug Cutting’s son. and it was easy
to pronounce and was the unique word.) Now he wanted to make Hadoop in such a
way that it can work well on thousands of nodes. So with GFS and MapReduce, he
started to work on Hadoop.

In 2007, Yahoo successfully tested Hadoop on a 1000 node cluster and start using it.

In January of 2008, Yahoo released Hadoop as an open source project to


ASF(Apache Software Foundation). And in July of 2008, Apache Software
Foundation successfully tested a 4000 node cluster with Hadoop.

In 2009, Hadoop was successfully tested to sort a PB (PetaByte) of data in less than 17
hours for handling billions of searches and indexing millions of web pages.

2011 June Yahoo has 42K Hadoop nodes and hundreds of petabytes of storage.
.
In December of 2011, Apache Software Foundation released Apache Hadoop version
1.0.

PATEL MOEEN M. (16BCA24) 8|Page


BIG DATA IQRA BCA COLLEGE

And later in Aug 2013, Version 2.0.6 was available.


And currently, we have Apache Hadoop version 3.0 which released in December
2017.

Scope And Limitation:

Issue with Small Files


Hadoop is not suited for small data. (HDFS) Hadoop distributed file system lacks
the ability to efficiently support the random reading of small files because of its high
capacity design.
Small files are the major problem in HDFS. A small file is significantly smaller than
the HDFS block size (default 128MB). If we are storing these huge numbers of small
files, HDFS can’t handle these lots of files, as HDFS was designed to work properly with
a small number of large files for storing large data sets rather than a large number of
small files. If there are too many small files, then the NameNode will be overloaded
since it stores the namespace of HDFS.

Slow Processing Speed


In Hadoop, with a parallel and distributed algorithm, MapReduce process large data
sets. There are tasks that need to be performed: Map and Reduce and, MapReduce
requires a lot of time to perform these tasks thereby increasing latency. Data is
distributed and processed over the cluster in MapReduce which increases the time and
reduces processing speed.

Latency
In Hadoop, MapReduce framework is comparatively slower, since it is designed to
support different format, structure and huge volume of data. In MapReduce, Map takes
a set of data and converts it into another set of data, where individual element are
broken down into key value pair and Reduce takes the output from the map as input
and process further and MapReduce requires a lot of time to perform these tasks
thereby increasing latency.

Not Easy to Use

PATEL MOEEN M. (16BCA24) 9|Page


BIG DATA IQRA BCA COLLEGE

In Hadoop, MapReduce developers need to hand code for each and every operation
which makes it very difficult to work. MapReduce has no interactive mode, but adding
one such as hive and pig makes working with MapReduce a little easier for adopters.

Vulnerable by Nature
Hadoop is entirely written in java, a language most widely used, hence java been most
heavily exploited by cyber criminals and as a result, implicated in numerous security
breaches.

Uncertainty
Hadoop only ensures that data job is complete, but it’s unable to guarantee
when the job will be complete.

Working And Architecture Of Hadoop.


The base Apache Hadoop framework is composed of the following modules:

 Hadoop Common – contains libraries and utilities needed by other Hadoop


modules;
 Hadoop Distributed File System (HDFS) – a distributed file-system that stores
data on commodity machines, providing very high aggregate bandwidth across the
cluster;
 Hadoop YARN – introduced in 2012 is a platform responsible for managing
computing resources in clusters and using them for scheduling users'
applications;[9][10]
 Hadoop MapReduce – an implementation of the MapReduce programming model
for large-scale data processing.

Hadoop FramWork:

PATEL MOEEN M. (16BCA24) 10 | P a g e


BIG DATA IQRA BCA COLLEGE

PATEL MOEEN M. (16BCA24) 11 | P a g e


BIG DATA IQRA BCA COLLEGE

The first one is HDFS for storage (Hadoop distributed File System), that allows you
to store data of various formats across a cluster. The second one is YARN, for
resource management in Hadoop. It allows parallel processing over the data, i.e.
stored across HDFS.

HDFS:
The Hadoop Distributed File System (HDFS) is the primary data storage system used
by Hadoop applications. It employs a NameNode and DataNode architecture to
implement a distributed file system that provides high-performance access to
data across highly scalable Hadoop clusters.

HDFS supports the rapid transfer of data between compute nodes. At its outset, it was
closely coupled with MapReduce, a programmatic framework for data processing.
When HDFS takes in data, it breaks the information down into separate blocks and
distributes them to different nodes in a cluster, thus enabling highly efficient parallel
processing.

HDFS is used for storing the data and MapReduce is used for the Processing the Data.
HDFS has five services as follows:
1. Name Node
2. Secondary Name Node
3. Job tracker
4. Data Node

PATEL MOEEN M. (16BCA24) 12 | P a g e


BIG DATA IQRA BCA COLLEGE

5. Task Tracker

Top three are Master Services/Daemons/Nodes and bottom two are Slave Services.
Master Services can communicate with each other and in the same way Slave services
can communicate with each other. Name Node is a master node and Data node is its
corresponding Slave node and can talk with each other.

Name Node: HDFS consists of only one Name Node we call it as Master Node which
can track the files, manage the file system and has the meta data and the whole data in
it. To be particular Name node contains the details of the No. of blocks, Locations at
what data node the data is stored and where the replications are stored and other
details. As we have only one Name Node we call it as Single Point Failure. It has Direct
connect with the client.

Data Node: A Data Node stores data in it as the blocks. This is also known as the slave
node and it stores the actual data into HDFS which is responsible for the client to read
and write. These are slave daemons. Every Data node sends a Heartbeat message to
the Name node every 3 seconds and conveys that it is alive. In this way when Name
Node does not receive a heartbeat from a data node for 2 minutes, it will take that data
node as dead and starts the process of block replications on some other Data node.
Secondary Name Node: This is only to take care of the checkpoints of the file system
metadata which is in the Name Node. This is also known as the checkpoint Node. It is
helper Node for the Name Node.

Job Tracker: Basically Job Tracker will be useful in the Processing the data. Job
Tracker receives the requests for Map Reduce execution from the client. Job tracker
talks to the Name node to know about the location of the data like Job Tracker will
request the Name Node for the processing the data. Name node in response gives the
Meta data to job tracker.

Task Tracker: It is the Slave Node for the Job Tracker and it will take the task from the
Job Tracker. And also it receives code from the Job Tracker. Task Tracker will take the
code and apply on the file. The process of applying that code on the file is known as
Mapper.
the Hadoop Distributed File System is specially designed to be highly fault-tolerant. The
file system replicates, or copies, each piece of data multiple times and distributes the
copies to individual nodes, placing at least one copy on a different server rack than the
others. As a result, the data on nodes that crash can be found elsewhere within a
cluster. This ensures that processing can continue while data is recovered.

PATEL MOEEN M. (16BCA24) 13 | P a g e


BIG DATA IQRA BCA COLLEGE

MapReduce

What Is MapReduce?
MapReduce is a programming model suitable for processing of huge data. Hadoop is
capable of running MapReduce programs written in various languages: Java, Ruby,
Python, and C++. MapReduce programs are parallel in nature, thus are very useful for
performing large-scale data analysis using multiple machines in the cluster.

MapReduce programs work in two phases:

1. Map phase
2. Reduce phase.

Input Splits:

An input to a MapReduce job is divided into fixed-size pieces called input


splits Input split is a chunk of the input that is consumed by a single map

PATEL MOEEN M. (16BCA24) 14 | P a g e


BIG DATA IQRA BCA COLLEGE

Mapping

This is the very first phase in the execution of map-reduce program. In this
phase data in each split is passed to a mapping function to produce output
values. In our example, a job of mapping phase is to count a number of
occurrences of each word from input splits (more details about input-split is
given below) and prepare a list in the form of <word, frequency>

Shuffling

This phase consumes the output of Mapping phase. Its task is to


consolidate the relevant records from Mapping phase output. In our
example, the same words are clubed together along with their respective
frequency.

Reducing

In this phase, output values from the Shuffling phase are aggregated. This
phase combines values from Shuffling phase and returns a single output
value. In short, this phase summarizes the complete dataset.

WordCount Program Of MapReduce.

PATEL MOEEN M. (16BCA24) 15 | P a g e


BIG DATA IQRA BCA COLLEGE

WordCount.Jar
2.

3. import java.io.IOException;

4. import java.util.*;

5.

6. import org.apache.hadoop.fs.Path;

7. import org.apache.hadoop.conf.*;

8. import org.apache.hadoop.io.*;

9. import org.apache.hadoop.mapred.*;

10. import org.apache.hadoop.util.*;

11.

12. public class WordCount {

13.

14. public static class Map extends MapReduceBase


implements Mapper<LongWritable, Text,Text, IntWritable> {

15. private final static IntWritable one = new IntWritable(1);

16. private Text word = new Text();

17.

18. public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) thro

19. String line = value.toString();

20. StringTokenizer tokenizer = new StringTokenizer(line);

21. while (tokenizer.hasMoreTokens()) {

22. word.set(tokenizer.nextToken());

23. output.collect(word, one);

PATEL MOEEN M. (16BCA24) 16 | P a g e


BIG DATA IQRA BCA COLLEGE

24. }

25. }

26. }

27.

28. public static class Reduce extends MapReduceBase


implements Reducer<Text, IntWritable,Text, IntWritable> {

29. public void reduce(Text key, Iterator<IntWritable> values,


OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException
{

30. int sum = 0;

31. while (values.hasNext()) {

32. sum += values.next().get();

33. }

34. output.collect(key, new IntWritable(sum));

35. }

36. }

37.

38. public static void main(String[] args) throws Exception {


39. JobConf conf = new JobConf(WordCount.class);

40. conf.setJobName("wordcount");

41.

42. conf.setOutputKeyClass(Text.class);

43. conf.setOutputValueClass(IntWritable.class);

44.

PATEL MOEEN M. (16BCA24) 17 | P a g e


BIG DATA IQRA BCA COLLEGE

45. conf.setMapperClass(Map.class);

46. conf.setCombinerClass(Reduce.class);

47. conf.setReducerClass(Reduce.class);

48.

49. conf.setInputFormat(TextInputFormat.class);

50. conf.setOutputFormat(TextOutputFormat.class);

51.

52. FileInputFormat.setInputPaths(conf, new Path(args[0]));

53. FileOutputFormat.setOutputPath(conf, new Path(args[1]));

54.

55. JobClient.runJob(conf);

57. }

58. }

YARN
Hadoop YARN is the resource management and job scheduling technology
in the open source Hadoop distributed processing framework. One of
Apache Hadoop's core components, YARN is responsible for allocating
system resources to the various applications running in a Hadoop
cluster and scheduling tasks to be executed on different cluster nodes.

YARN stands for Yet Another Resource Negotiator, but it's commonly
referred to by the acronym alone. The technology became an Apache

PATEL MOEEN M. (16BCA24) 18 | P a g e


BIG DATA IQRA BCA COLLEGE

Hadoop subproject within the Apache Software Foundation (ASF) in 2012


and was one of the key features added in Hadoop 2.0, which was released
for testing that year and became generally available in October 2013.

YARN performs all your processing activities by allocating resources and scheduling
tasks.

Figure: What is Hadoop – YARN

It has two major components, i.e. ResourceManager and NodeManager.

ResourceManager is again a master node. It receives the processing requests and


then passes the parts of requests to corresponding NodeManagers accordingly,
where the actual processing takes place. NodeManagers are installed on every
DataNode. It is responsible for the execution of the task on every single DataNode.

PATEL MOEEN M. (16BCA24) 19 | P a g e


BIG DATA IQRA BCA COLLEGE

Experiments On Hadoop
. Experiments

4.1. Experiment Settings

Figure 5 illustrates the experiment platform deployed on three Dell servers and
one PC. The Hadoop cluster is composed of three Dell servers, which serve as
one NameNode and two DataNodes. It does not hold lots of meanings if tens of
DataNodes are deployed considering that the most usual application locations of
WSN are outdoor, in mobile status, or even in barren areas. Each Hadoop cluster
node is running on the Linux OS (Debian 3.2.46 x86_64). No matter the
NameNode and DataNode are, each Hadoop-node is equipped with two Intel
Xeon 8 core 2.00 GHz processors with 64 GB registered ECC DDR memory and
3 TB SATA-2. The web application is deployed on Tomcat web engine running on
a PC machine, which is equipped with Linux OS (Ubuntu Server 3.5.0 x86_64),
Intel Core 2 Duo CPU 3.0 GHz with 2 GB registered DDR memory, 320 GB
SATA-2. Java 1.6.0_37, Apache Tomcat 7.0.39, Hadoop-0.20.2, and FFMPEG
1.0.6 are the other components used in the platform.

Figure 5 Experimental platform.

Several video data sets are used for the system performance evaluation. The
video data sets are generated by emerging replications of the original file using

PATEL MOEEN M. (16BCA24) 20 | P a g e


BIG DATA IQRA BCA COLLEGE

the Format Factory . Each original file size is 80MB. Table 1 lists the parameters
for original and transcoded video file.

Experimental Results and Analysis

This subsection focuses on discussing and analyzing the experiment results. The
transcoding time consumption is employed as the performance metric. In the
experiments, we choose parameters about the number of Mappers that do
transcoding work and duration and size of video files. We carry out a series of
experiments based on these factors.

PATEL MOEEN M. (16BCA24) 21 | P a g e


BIG DATA IQRA BCA COLLEGE

Figure 6 indicates the effect of number of Mappers. For the files whose durations
are 5 h 18 m 34 s and 10 h 37 m 9s, as the Mappers’ number increases, the time
consumptions of transcoding decrease at the beginning and reach the lowest
point when the number is 9. Later, they increase obviously as the number of
Mappers is greater than 9. For the files whose durations are 21 h 14 h 18 s and
42 h 28 m 37 s, time consumptions go down when Mappers’ number increases.

Figure 6 Effect of number of Mappers.

In the Hadoop system, a Mapper is designated to process a split part of video file
and is invoked by a DataNode. The more the Mappers, the more the transcoding
tasks distributed to DataNode, which accelerates the transcoding process. On the
other hand, there are also queues in each DataNode keeping the disengaged
transcoding tasks waiting, which retards the transcoding process. There should
be a “Laffer curve” just as the case in economics

PATEL MOEEN M. (16BCA24) 22 | P a g e


BIG DATA IQRA BCA COLLEGE

The curve depicted in Figure 6 reveals that there exists an optimal value of the
number of Mappers for the Hadoop-based VTS and the value is closely related to
the size of files. The optimal value is 9 for the video files whose durations are 5 h
18 m 34 s and 10 h 37 m 9s in experiment settings of this paper, while, for the
other two cases, the optimal value should be around 12.

(2) Effect of Duration and Size. The second set of experiments tries to explore the
relationship of the transcoding time with the size and duration of video files. In
these experiments, we specify the Mappers’ number as 9. By fixing the duration
of two video files and changing their sizes, we investigate the transcoding time
consumption. Tables 4 and 5 indicate the system performance with respect to
various file sizes. The durations of two video files are 5 h 18 m 34 s (Table 4) and
10 h 37 m 09 s (Table 5), respectively.

PATEL MOEEN M. (16BCA24) 23 | P a g e


BIG DATA IQRA BCA COLLEGE

Conclusion and Future Work

In this paper, we propose a Hadoop-based VTS integrating several key


components including HDFS, Map/Reduce, FFMPEG, and Tomcat, with a
discussion on several significant parameters. Three prominent results are
achieved through the experiments: (1) it is clear that there is an optimal value of
the number of Mappers; (2) the optimal value is closely related to the file size; (3)
the time consumption of video transcoding depends principally on the duration of
video files rather than their sizes.

The inherited distribution property of the Hadoop system seems quite harmonious
with the decentralization attribute of next-generation WSN. However, the research
exploring their relationship and integrating them into a turn-key solution for many
practical problems is still in its preliminary stage. As one of the first papers
pioneering in this direction, this paper enlightens several directions for future
work.

First of all, more experiments will be carried out not only in the area of standards
transcoding and spatial transcoding, but also in the field of bit-rate transcoding, to
meet the service requirements of the next-generation WSN, such as converting
the video resources for video broadcast or streaming.

Secondly, some efforts have to focus on WSN's network planning and routing
protocol optimization in order to integrate seamlessly with Hadoop system. The
normal Hadoop system generally operates in an indoor and machine-friendly
environment with wired connections. However, the WSN system always works in
outdoor locations with tough surroundings, such as severe interferences or
extremely high or low temperatures, and so forth. Accordingly, the communication
protocols among Hadoop nodes also have to be redesigned with the
consideration of WSN characteristics.

Besides, some theoretical research will be done to find the optimal value of the
Mappers’ number. The mathematical model is going to be constructed, taking the

PATEL MOEEN M. (16BCA24) 24 | P a g e


BIG DATA IQRA BCA COLLEGE

Hadoop cluster size, block size, video file size, and block replication factor into
account.

Advantage And Disadvantage


1. Scalable
Hadoop is a highly scalable storage platform, because it can stores and distribute very large
data sets across hundreds of inexpensive servers that operate in parallel. Unlike traditional
relational database systems (RDBMS) that can’t scale to process large amounts of data,
Hadoop enables businesses to run applications on thousands of nodes involving many
thousands of terabytes of data.

2. Cost effective
Hadoop also offers a cost effective storage solution for businesses’ exploding data sets. The
problem with traditional relational database management systems is that it is extremely cost
prohibitive to scale to such a degree in order to process such massive volumes of data. In an
effort to reduce costs, many companies in the past would have had to down-sample data and
classify it based on certain assumptions as to which data was the most valuable. The raw data
would be deleted, as it would be too cost-prohibitive to keep. While this approach may have
worked in the short term, this meant that when business priorities changed, the complete raw
data set was not available, as it was too expensive to store.

3. Flexible
Hadoop enables businesses to easily access new data sources and tap into different types of
data (both structured and unstructured) to generate value from that data. This means
businesses can use Hadoop to derive valuable business insights from data sources such as
social media, email conversations. Hadoop can be used for a wide variety of purposes, such
as log processing, recommendation systems, data warehousing, market campaign analysis
and fraud detection.

4. Fast
Hadoop’s unique storage method is based on a distributed file system that basically ‘maps’
data wherever it is located on a cluster. The tools for data processing are often on the same
servers where the data is located, resulting in much faster data processing. If you’re dealing
with large volumes of unstructured data, Hadoop is able to efficiently process terabytes of
data in just minutes, and petabytes in hours.

PATEL MOEEN M. (16BCA24) 25 | P a g e


BIG DATA IQRA BCA COLLEGE

5. Resilient to failure
A key advantage of using Hadoop is its fault tolerance. When data is sent to an individual
node, that data is also replicated to other nodes in the cluster, which means that in the event
of failure, there is another copy available for use.

Disadvantages of Hadoop:
As the backbone of so many implementations, Hadoop is almost synomous with big data.

1. Security Concerns
Just managing a complex applications such as Hadoop can be challenging. A simple example
can be seen in the Hadoop security model, which is disabled by default due to sheer
complexity. If whoever managing the platform lacks of know how to enable it, your data
could be at huge risk. Hadoop is also missing encryption at the storage and network levels,
which is a major selling point for government agencies and others that prefer to keep their
data under wraps.

2. Vulnerable By Nature
Speaking of security, the very makeup of Hadoop makes running it a risky proposition. The
framework is written almost entirely in Java, one of the most widely used yet controversial
programming languages in existence. Java has been heavily exploited by cybercriminals and
as a result, implicated in numerous security breaches.

3. Not Fit for Small Data


While big data is not exclusively made for big businesses, not all big data platforms are
suited for small data needs. Unfortunately, Hadoop happens to be one of them. Due to its
high capacity design, the Hadoop Distributed File System, lacks the ability to efficiently
support the random reading of small files. As a result, it is not recommended for
organizations with small quantities of data.

4. Joining issue with large dataset

Future Enhancement
Future of Hadoop:

As you can see, already most of the organizations have started using the Hadoop for storing
and processing their large volume of data and they are not able to fulfill the required

PATEL MOEEN M. (16BCA24) 26 | P a g e


BIG DATA IQRA BCA COLLEGE

positions of people as required by them as we dont have people with required skills when
compared with the demand for the professionals with the desired skills.

Hadoop is mainly used by various companies because the amount of data that is getting
generated nowadays are very huge in size and it is getting increased day by day. Already the
data that is getting generated every second around the globe is in terms of terabytes or
petabytes and so it is going to still increase in bounds of bytes in the near future.

The Big Data would be keep on increasing in size as more and more people are using the
digital platform for most of their day to day activities and so processing of these data needs
to be done more effectively to identify the trends and predictions about the future growth
using the existing pattern of the data.

So the companies will be still using Hadoop as it is one of the best and cost effective
framework which can support any volume of data and it can be scalable to any number of
servers in a cluster with minimal cost to support such a huge volume of data.

The combination of a scalable, distributed filesystem (HDFS) combined with a flexible


processing engine (MapReduce) formed the main reason for Hadoop to gain interest in the
market.

PATEL MOEEN M. (16BCA24) 27 | P a g e


BIG DATA IQRA BCA COLLEGE

So the demand for the talents with Hadoop skills are very high now and it will not reduce in
the near future as there is a good scope for improvement in the hadoop framework if more
and more people are working on the technology.

Various companies had already invested lots of effort for learning and implementing the
Hadoop in their day to day work and so they will continue to use the same for coming years
as the cost and effort involving in migrating to a new technology and finding the people with
required skillset is a tedious task which involves lots of time and cost. So most of the
companies will not be ready to invest on new technologies unless they are 100% sure of the
acceptability and adaptability of the new technology.

Conclusion
 Used by many companies.
 Easy to install, just need Linux workstations on a network.
 Takes care of hard problems: failover, ginormous distributed file system.

PATEL MOEEN M. (16BCA24) 28 | P a g e

You might also like