0% found this document useful (0 votes)
9 views6 pages

Bigdata

MapReduce in Hadoop consists of two phases: the Map phase, which applies complex logic to data, and the Reduce phase, which performs lightweight processing like aggregation. Hadoop is an open-source framework designed for storing and processing large datasets efficiently and fault-tolerantly across commodity servers. The growing demand for Hadoop professionals highlights its importance in managing big data, with a robust ecosystem supporting various industries and applications.

Uploaded by

jithugirish3424
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views6 pages

Bigdata

MapReduce in Hadoop consists of two phases: the Map phase, which applies complex logic to data, and the Reduce phase, which performs lightweight processing like aggregation. Hadoop is an open-source framework designed for storing and processing large datasets efficiently and fault-tolerantly across commodity servers. The growing demand for Hadoop professionals highlights its importance in managing big data, with a robust ecosystem supporting various industries and applications.

Uploaded by

jithugirish3424
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

MapReduce

In Hadoop, MapReduce works by breaking the data processing into two phases: Map
phase and Reduce phase. The map is the first phase of processing, where we specify
all the complex logic/business rules/costly code. Reduce is the second phase of
processing, where we specify light-weight processing like aggregation/summation.
You can run a MapReduce job with a single line of code: JobClient.runJob(conf). It's
very short, but it conceals a great deal of processing behind the scenes. At the highest
level, there are four independent entities:
1. The client, which submits the MapReduce job.
2. The jobtracker, which coordinates the job run. The jobtracker is a Java application
whose main class is JobTracker.
3. The tasktrackers, which run the tasks that the job has been split into. Tasktrackers
are Java applications whose main class is Task Trucker.
4. The distributed file system, which is used for sharing job files between the other
entities.
The MapReduce model is to break jobs into tasks and run the tasks in parallel to make
the overall job execution time smaller than it would otherwise be if the tasks ran
sequentially. This makes job execution time sensitive to slow-running tasks, as it takes
only one slow task to make the whole job take significantly longer than it would have
done otherwise. When a job consists of hundreds or thousands of tasks, the possibility
of a few straggling tasks is very real. Tasks may be slow for various reasons, including
hardware degradation or software mis-configuration, but the causes may be hard to
detect since the tasks still complete successfully, albeit after a longer time than
expected. Hadoop doesn't try to diagnose and fix slow-running tasks, instead, it tries to
detect when a task is running slower than expected and launches another, equivalent,
task as a backup. This is termed speculative execution of tasks. Hadoop runs tasks in
their own Java Virtual Machine to isolate them from other running tasks. The
overhead of starting a new JVM for each task can take around a second, which for jobs
that run for a minute or so is insignificant. However, jobs that have a large number of
very short-lived tasks (these are usually map tasks), or that have lengthy initialization,
can see performance gains when the JVM is reused for subsequent tasks. With task
JVM reuse enabled, tasks do not run concurrently in a single JVM. The JVM runs
tasks sequentially. Tasktrackers can, however, run more than one task at a time, but
this is always done in separate JVMs.
How MapReduce Works
• Map takes a set of data and converts it into another set of data, where individual
elements are broken down into tuples (key/value pairs).
• All the data whether structured or unstructured needs to be translated to the key-
value pair before it is passed through the MapReduce model.
• Generally MapReduce paradigm is based on sending the computer to where the
data resides!. MapReduce model as the name suggests has two different
functions; Map-function and Reduce-function. The order of operation is always
Map|Shuffle|Reduce.
• The reduce task takes the output from a map as an input and combines those data
tuples into a smaller set of tuples.
• As the sequence of the name MapReduce implies, the reduce task is always
performed after the map job.
• Under the MapReduce model, the data processing primitives are called mappers
andreducers.

What is Hadoop?

• Hadoop is open-source, Java based programming framework and server software


which is used to save and analyze data with the help of 100s or even 1000s of
commodity servers in a clustered environment.
• Hadoop is designed to storage and process large datasets extremely fast and in
fault tolerant way.
• Hadoop uses HDFS (Hadoop File System) for storing data on cluster of
commodity computers. If any server goes down it know how to replicate the data
and there is no loss of data even in hardware failure.
• Hadoop is Apache sponsored project and it consists of many software packages
which runs on the top of the Apache Hadoop system.
• Top Hadoop based Commercial Big Data Analytics Platform
• Hadoop provides set of tools and software for making the backbone of the Big
Data analytics system.
Why Hadoop is Important?
1. Managing Big Data
• As we are living in the digital era there is a data explosion. The data is getting
generated at a very high speed and high volume. So there is an increasing need to
manage this Big Data. Therefore in order to manage this ever-increasing volume of data,
we require Big Data technologies like Hadoop. There is an increasing need for a solution
which could handle this much amount of data. In this scenario, Hadoop comes to rescue.
With its robust architecture and economical feature, it is the best fit for storing huge
amounts of data.
2. Exponential Growth of Big Data Market
Slowly companies are realizing the advantage big data can bring to their business. As
the market for Big Data grows there will be a rising need for Big Data technologies.
Hadoop forms the base of many big data technologies. The new technologies like
Apache Spark and Flink work well over Hadoop. As it is an indemand big data
technology, there is a need to master Hadoop. As the requirements for Hadoop
professionals are increasing, this makes it a must to learn technology.
3. Lack of Hadoop Professionals
As we have seen, the Hadoop market is continuously growing to create more job
opportunities every day. Most of these Hadoop job opportunities remain vacant due to
unavailability of the required skills. So this is the right time to show your talent in big
data by mastering the technology before it is too late. Become a Hadoop expert and give
a boost to your career. This is where Data Flair plays an important role to make you
Hadoop expert.
4. Hadoop for all
Professionals from various streams can easily learn Hadoop and become master of it to
get high paid jobs. IT professionals can easily learn MapReduce programming
in java or python, those who know scripting can work on Hadoop ecosystem component
named Pig. Hive or drill is easy for those who know to the script.
5. Robust Hadoop Ecosystem
Hadoop has a very robust and rich ecosystem which serves a wide variety of
organizations. Organizations like web start-ups, telecom, financial and so on are needing
Hadoop to answer their business needs.
6. Research Tool
Hadoop has come up as a powerful research tool. It allows an organization to find
answers to their business questions. Hadoop helps them in research and development
work. Companies use it to perform the analysis. They use this analysis to develop a
rapport with the customer.
8. Hadoop is Omnipresent
There is no industry where Big Data has not reached. Big Data has covered almost all
domains like healthcare, retail, government, banking, media, transportation, natural
resources and so on. People are increasingly becoming data aware. This means they are
realizing the power of data. Hadoop is a framework which can harness this power of
data to improve the business.

9. Higher Salaries
In the current scenario, there is a gap between demand and supply of Big Data
professional. This gap is increasing every day. In the wake of the scarcity of Hadoop
professionals, organizations are ready to offer big packages for Hadoop skills. There is
always a compelling requirement of skilled people who can think from a business point
of view. They are the people who understand data and can produce insights with that
data. For this reason, technical persons with analytics skills find them in great demand.

10. A Maturing Technology


Hadoop is evolving with time. The new version of Hadoop i.e. Hadoop 3.0 is coming
into the market. It has already collaborated with HortonWorks, Tableau, MapR, and
even BI experts to name a few. New actors like Spark, Flink etc. are coming on the Big
Data stage. These technologies promise the lightening speed of processing. These
technologies also provide a single platform for various kinds of workloads

Advantages of Hadoop
Hadoop framework allows the user to quickly write and test distributed
systems. It is efficient, and it automatic distributes the data and work across
the machines and in turn, utilizes the underlying parallelism of the CPU
cores.
Hadoop does not rely on hardware to provide fault-tolerance and high
availability (FTHA), rather Hadoop library itself has been designed to detect
and handle failures at the application layer.
Servers can be added or removed from the cluster dynamically and Hadoop
continues to operate without interruption.
Another big advantage of Hadoop is that apart from being open source, it is
compatible on all the platforms since it is Java based.
HDFS (Hadoop Distributed File System)
Hadoop File System was developed using distributed file system design. It is run
on commodity hardware. Unlike other distributed systems, HDFS is highly fault
tolerant and designed using low-cost hardware. HDFS holds very large amount of
data and provides easier access. To store such huge data, the files are stored across
multiple machines. These files are stored in redundant fashion to rescue the system
from possible data losses in case of failure. HDFS also makes applications
available to parallel processing. The Features of HDFS are:

. It is suitable for the distributed storage and processing.


. Hadoop provides a command interface to interact with HDFS.
. The built-in servers of name node and data node help users to easily check the
status of cluster.
. Streaming access to file system data.
. HDFS provides file permissions and authentication.
•Node
. A computer.
A non-enterprise commodity hardware.
• Rack
. A collection of 30-40 nodes.
. Connected to the same network switch.
. Physically stored close together.
• Hadoop Cluster
. A collection of racks.
• HDFS follows the master-slave architecture and it has the following elements.
1.Name Node
The name node is the commodity hardware that contains the GNU/Linux
operating system and the name node software. It is a software that can be run on
commodity hardware. The system having the name node acts as the master server
and it does the following tasks
• Manages the file system namespace.
• Regulates client's access to files.
• It also executes file system operations such as renaming, closing, and opening files
and directories.

2.Data Node
The data node is a commodity hardware having the GNU/Linux operating system
and data node software. For every node (Commodity hardware/System) in a
cluster, there will be a data node. These nodes manage the data storage of their
system. Data nodes perform read-write operations on the file systems, as per client
request. They also perform operations such as block creation, deletion, and
replication according to the instructions of the name node.
3.Block
Generally, the user data is stored in the files of HDFS. The file in a file system
will be divided into one or more segments and/or stored in individual data nodes.
These file segments are called as blocks. In other words, the minimum amount of
data that HDFS can read or write is called a Block. The default block size is 128
MB, but it can be increased as per the need to change in HDFS configuration.
What is surprise number
Consider a data stream of elements from a universal set. Let my be the number of
occurrences of the i ^ (th) element for any i. Then the k^ th . . moment of the
stream is the sum over all i of (m_{i}) ^ k The 1 ^ (st) moment is the sum of mi's.
That is the length of the stream. The 2 ^ (nd) moment is the sum of squares of
mi's. It is also called a "surprise number".
Suppose, we have a stream of length 100, in which eleven different elements
appear. The most even distribution of these eleven elements would have one
appearing 10 times and the other ten appearing 9 times each. In this case, the
surprise number is 10 ^ 2 + 1* 9 ^ 2 = 910

5. Define decaying windows.


This approach is used for finding the most popular element in the stream. This can
be considered as an extension of DGIM Algorithm. The aim is to weight the recent
elements more heavily.
. Recording the popularity of items sold at Amazon.
. The rate at which different Twitter-users tweet.

You might also like