Bigdata
Bigdata
In Hadoop, MapReduce works by breaking the data processing into two phases: Map
phase and Reduce phase. The map is the first phase of processing, where we specify
all the complex logic/business rules/costly code. Reduce is the second phase of
processing, where we specify light-weight processing like aggregation/summation.
You can run a MapReduce job with a single line of code: JobClient.runJob(conf). It's
very short, but it conceals a great deal of processing behind the scenes. At the highest
level, there are four independent entities:
1. The client, which submits the MapReduce job.
2. The jobtracker, which coordinates the job run. The jobtracker is a Java application
whose main class is JobTracker.
3. The tasktrackers, which run the tasks that the job has been split into. Tasktrackers
are Java applications whose main class is Task Trucker.
4. The distributed file system, which is used for sharing job files between the other
entities.
The MapReduce model is to break jobs into tasks and run the tasks in parallel to make
the overall job execution time smaller than it would otherwise be if the tasks ran
sequentially. This makes job execution time sensitive to slow-running tasks, as it takes
only one slow task to make the whole job take significantly longer than it would have
done otherwise. When a job consists of hundreds or thousands of tasks, the possibility
of a few straggling tasks is very real. Tasks may be slow for various reasons, including
hardware degradation or software mis-configuration, but the causes may be hard to
detect since the tasks still complete successfully, albeit after a longer time than
expected. Hadoop doesn't try to diagnose and fix slow-running tasks, instead, it tries to
detect when a task is running slower than expected and launches another, equivalent,
task as a backup. This is termed speculative execution of tasks. Hadoop runs tasks in
their own Java Virtual Machine to isolate them from other running tasks. The
overhead of starting a new JVM for each task can take around a second, which for jobs
that run for a minute or so is insignificant. However, jobs that have a large number of
very short-lived tasks (these are usually map tasks), or that have lengthy initialization,
can see performance gains when the JVM is reused for subsequent tasks. With task
JVM reuse enabled, tasks do not run concurrently in a single JVM. The JVM runs
tasks sequentially. Tasktrackers can, however, run more than one task at a time, but
this is always done in separate JVMs.
How MapReduce Works
• Map takes a set of data and converts it into another set of data, where individual
elements are broken down into tuples (key/value pairs).
• All the data whether structured or unstructured needs to be translated to the key-
value pair before it is passed through the MapReduce model.
• Generally MapReduce paradigm is based on sending the computer to where the
data resides!. MapReduce model as the name suggests has two different
functions; Map-function and Reduce-function. The order of operation is always
Map|Shuffle|Reduce.
• The reduce task takes the output from a map as an input and combines those data
tuples into a smaller set of tuples.
• As the sequence of the name MapReduce implies, the reduce task is always
performed after the map job.
• Under the MapReduce model, the data processing primitives are called mappers
andreducers.
What is Hadoop?
9. Higher Salaries
In the current scenario, there is a gap between demand and supply of Big Data
professional. This gap is increasing every day. In the wake of the scarcity of Hadoop
professionals, organizations are ready to offer big packages for Hadoop skills. There is
always a compelling requirement of skilled people who can think from a business point
of view. They are the people who understand data and can produce insights with that
data. For this reason, technical persons with analytics skills find them in great demand.
Advantages of Hadoop
Hadoop framework allows the user to quickly write and test distributed
systems. It is efficient, and it automatic distributes the data and work across
the machines and in turn, utilizes the underlying parallelism of the CPU
cores.
Hadoop does not rely on hardware to provide fault-tolerance and high
availability (FTHA), rather Hadoop library itself has been designed to detect
and handle failures at the application layer.
Servers can be added or removed from the cluster dynamically and Hadoop
continues to operate without interruption.
Another big advantage of Hadoop is that apart from being open source, it is
compatible on all the platforms since it is Java based.
HDFS (Hadoop Distributed File System)
Hadoop File System was developed using distributed file system design. It is run
on commodity hardware. Unlike other distributed systems, HDFS is highly fault
tolerant and designed using low-cost hardware. HDFS holds very large amount of
data and provides easier access. To store such huge data, the files are stored across
multiple machines. These files are stored in redundant fashion to rescue the system
from possible data losses in case of failure. HDFS also makes applications
available to parallel processing. The Features of HDFS are:
2.Data Node
The data node is a commodity hardware having the GNU/Linux operating system
and data node software. For every node (Commodity hardware/System) in a
cluster, there will be a data node. These nodes manage the data storage of their
system. Data nodes perform read-write operations on the file systems, as per client
request. They also perform operations such as block creation, deletion, and
replication according to the instructions of the name node.
3.Block
Generally, the user data is stored in the files of HDFS. The file in a file system
will be divided into one or more segments and/or stored in individual data nodes.
These file segments are called as blocks. In other words, the minimum amount of
data that HDFS can read or write is called a Block. The default block size is 128
MB, but it can be increased as per the need to change in HDFS configuration.
What is surprise number
Consider a data stream of elements from a universal set. Let my be the number of
occurrences of the i ^ (th) element for any i. Then the k^ th . . moment of the
stream is the sum over all i of (m_{i}) ^ k The 1 ^ (st) moment is the sum of mi's.
That is the length of the stream. The 2 ^ (nd) moment is the sum of squares of
mi's. It is also called a "surprise number".
Suppose, we have a stream of length 100, in which eleven different elements
appear. The most even distribution of these eleven elements would have one
appearing 10 times and the other ten appearing 9 times each. In this case, the
surprise number is 10 ^ 2 + 1* 9 ^ 2 = 910