0% found this document useful (0 votes)
12 views6 pages

CH 2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views6 pages

CH 2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 6

Data teams - composed of members who fit into 3 categories

-data engineers - responsible for finding connections or relations within the data.
correct and clean the data.
-data modelers - focus on model generation. makes use of machine learning.
-subject matter expert - provide feedback for the model so it can be revised

Hadoop - OS for big data

2 components
HDFS - hadoop distributed file system
-interacts with memory components on all computers on distributed network

YARN - yet another resource negotiator


-run algorithms that dictate who the coordinator is

**MapReduce** - Technically an application, though sometimes considered another


component
runs batch in background

Architechture

-Applications that carry out specific tasks - Hive(SQL), STORM(Streaming),


MapReduce(Batch), Spark(Real time processing)
-YARN dictates coordinators, distributes tasks between computers
-HDFS runs on all the computers, handling memory across the computers
-cluster of interconnected different computers, each with their own memory (Cluster
of Nodes)

Distributes System Requirements


-Fault tolerance - if a component fails, it should not result in failure of entire
system
-Recovery - in the event of failure, no data should be lost
-Consistency - failure of one job should not affect final result
-Scalability - able to add larger load (more data, more computation) without
decline in performance

How Hadoop enforces these requirements

- data is distriuted immediately by YARN when added to the cluster and stored on
multiple nodes.
nodes process data that is stored locally in order to minimize traffic across the
network.
- data is stored in blocks of fixed size (usually 128 mb) and each block is
duplicated multipe times across the system to provide redundancy and data safety.
- A computation is reffered to as a job. jobs are broken into tasks where each
individual node performs the task on a single block of data.
- Jobs are written at a high-level (abstract) without concern for network
programming, time, or low level infrastructure (each individual node)(HDFS and YARN
handle this),
allowing devs to focus on data and computation rather than distributed programming
details.
- reduce reading and writing during computation to speed up process.
- the amount of network traffic between nodes should be minimized transparently by
the system. Each task should be independent and nodes should not have to
communicate
with each other during processing to ensure that there are no interporcess
dependencies that could lead to deadlock.
- Jobs are fault tolerant usually through task redundancy, such that if a single
task fails, the final computation is not incorrect or incomplete
- Master programs allocate work to worker nodes such that many worker nodes can
operate in parallel, each on their own portion of the larger dataset.

HADOOP CLUSTERS

HDFS and YARN expose an application programming interface (API) that abstracts
developers from low-level cluster administration details.
A set of machines that is running HDFS and YARN is known as a cluster, and the
individual machines are called nodes.
A cluster can have a single node, or many thousands of nodes, but all clusters
scale horizontally, meaning as you add more nodes,
the cluster increases in both capacity and performance in a linear fashion.

daemon processes - background processes of HDFS and YARN that make any computation
of big data successful! Work as master programs that allocate work to worker nodes
allowing them to work in parallell, each on their own portion of the larger
dataset.
MapReduce runs the Daemon processes in parallell as batch, runnign them int he
background.

YARN and HDFS are implemented by several daemon processes—that is, software that
runs in the background and does not require user input.
Hadoop processes are services, meaning they run all the time on a cluster node and
accept input and deliver output through the network, similar to how an HTTP server
works.
Each of these processes runs inside of its own Java Virtual Machine (JVM) so each
daemon has its own system resource allocation and is managed independently by the
operating system.
Each node in the cluster is identified by the type of process or processes that it
runs:

Master nodes
These nodes run coordinating services for Hadoop workers and are usually the entry
points for user access to the cluster.
Without masters, coordination would fall apart, and distributed storage or
computation would not be possible.

Worker nodes
These nodes are the majority of the computers in the cluster. Worker nodes run
services that accept tasks from master nodes
—either to store or retrieve data or to run a particular application.
A distributed computation is run by parallelizing the analysis across worker nodes.

HDFS and YARN work in concert to minimize the amount of network traffic in the
cluster primarily by ensuring that data is local to the required computation.
Duplication of both data and tasks ensures fault tolerance, recoverability, and
consistency.
Moreover, the cluster is centrally managed to provide scalability and to abstract
low-level clustering programming details.
Together, HDFS and YARN are a platform upon which big data applications are built;
perhaps more than just a platform, they provide an operating system for big data.
HDFS
NameNode (Master)
-Stores the directory tree of the file system, file metadata, and the locations of
each file in the cluster.
-Clients wanting to access HDFS must first locate the appropriate storage nodes by
requesting information from the NameNode.
-The master NameNode keeps track of what blocks make up a file and where those
blocks are located.
-The NameNode communicates with the DataNodes, the processes that actually hold the
blocks in the cluster.
-Metadata associated with each file is stored in the memory of the NameNode master
for quick lookups, and if the NameNode stops or fails,
the entire cluster will become inaccessible!

-When a client application wants access to read a file, it first requests the
metadata from the NameNode to locate the blocks that make up the file,
as well as the locations of the DataNodes that store the blocks. The application
then communicates directly with the DataNodes to read the data.
Therefore, the NameNode simply acts like a journal or a lookup table and is not a
bottleneck to simultaneous reads.
-HDFS manages info about each chunk so that they are retrieved in order via file
metadata.
-NameNode does not process or store the data.

-To upload a file, request it from NameNode. The NameNode will look up where it
will be stored, divide it into chunks, and store those chunks in the DataNodes.
Info about chunks of a file is stored through Hash3

Secondary NameNode (Master)


-Performs housekeeping tasks and checkpointing on behalf of the NameNode.
-Despite its name, it is not a backup NameNode.
-The Secondary NameNode is not a backup to the NameNode, but instead performs
housekeeping tasks on behalf of the NameNode,
including (and especially) periodically merging a snapshot of the current data
space with the edit log to ensure that the edit log doesn’t get too large.
-The edit log is used to ensure data consistency and prevent data loss; if the
NameNode fails, this merged record can be used to reconstruct the state of the
DataNodes.

DataNode (Worker)
-Stores and manages HDFS blocks on the local disk.
-Reports health and status of individual data stores back to the NameNode.

-HDFS files are split into blocks, usually of either 64 MB or 128 MB, although this
is configurable at runtime and high-performance systems
typically select block sizes of 256 MB. The block size is the minimum amount of
data that can be read or written to in HDFS.
However, unlike blocks on a single disk, files that are smaller than the block size
do not occupy the full blocks’ worth of space on the actual file system.
Additionally, blocks will be replicated across the DataNodes. By default, the
replication is three-fold, but this is also configurable at runtime.

HDFS performs best with moderate to large number of very large files.
Storage Pattern - HDFS implements WORM (Write Once, Read Many) - No random writes
or appends to files
HDFS is optimized for large streaming reading of files, no random reading or
selection
kind of applications that work good with HDFS? -
It is not a good fit as a data backend for applications that require updates in
real-time, interactive data analysis, or record-based transactional support.
does not work well with transactional applications as they require real time
updates with each transaction to maintain consistency and integrity.
Instead, by writing data only once and reading many times, HDFS users tend to
create large stores of heterogeneous data to aid in a variety of different
computations and analytics.
These stores are sometimes called “data lakes” because they simply hold all data
about a known problem in a recoverable and fault-tolerant manner.

HDFS file system similar to UNIX/Linux not Windows

YARN Nodes
ResourceManager (Master)
-Allocates and monitors available cluster resources (e.g., physical assets like
memory and processor cores) to applications as well as
handling scheduling of jobs on the cluster.

ApplicationMaster (Master)
-Coordinates a particular application being run on the cluster as scheduled by the
ResourceManager.

NodeManager (Worker)
-Runs and manages processing tasks on an individual node as well as reports the
health and status of tasks as they’re running.

-clients that wish to execute a job must first request resources from the
ResourceManager, which assigns an application-specific ApplicationMaster for the
duration of the job.
The ApplicationMaster tracks the execution of the job, while the ResourceManager
tracks the status of the nodes,
and each individual NodeManager creates containers and executes tasks within them.
Note that there may be other processes running on the Hadoop cluster as well—for
example,
JobHistory servers or ZooKeeper coordinators, but these services are the primary
software running in a Hadoop cluster.

the moment the application master node ceases to recieve a live signal from a
worker node, it contacts the resource manager. Thise ensures fault tolerance, as a
backup
worker node can now take over

Master processes are so important that they usually are run on their own node so
they don’t compete for resources and present a bottleneck.
However, in smaller clusters, the master daemons may all run on a single node.

single node cluster is possible.

MAPREDUCE
MapReduce is a simple but very powerful computational framework specifically
designed to enable fault-tolerant distributed computation across
a cluster of centrally managed machines. It does this by employing a “functional”
programming style that is inherently parallelizable—by allowing multiple
independent tasks to execute a function on local chunks of data and aggregating the
results after processing.

Functional programming is a style of programming that ensures unit computations are


evaluated in a stateless manner.
This means functions depend only on their inputs, and they are closed and do not
share state.
Data is transferred between functions by sending the output of one function as the
input to another, wholly independent function.
These traits make functional programming a great fit for distributed, big data
computational systems,
because it allows us to move the computation to any node that has the data input
and guarantee that we will still get the same result.
Because functions are stateless and depend solely on their input, many functions on
many machines can work independently on smaller chunks of the dataset.
By strategically chaining the outputs of functions to the inputs of other
functions, we can guarantee that we will reach a final computation across the
entire dataset.

multiple tasks are carried out across nodes in a cluster , no need of sharing
state!
i.e. task carried out by one node depends on any other node in terms of a sequence

mapping the input numbers to a group/cluster of 16 nodes/particpants


reducing input 32 numbers to their final sum!

Map applies a function to many things independently, before sending off its many
results to the reducer in a sequence
Reduce has two arguments(a sequence and a function that it will use to operate over
the sequence to reduce it to a single thing.)

MapReduce is stateless. variables are given as input and carried forward via
output. As such, any node thet has the data input can complete the computation
and get the same result. By chaining outputs of functions to the inputs of other
functions,
we can guarantee that we will reach a final computation across the entire dataset.

The MapReduce model consists of two main phases: the Map phase and the Reduce
phase.
Map Phase: -
-In this phase, the input data is divided into smaller chunks, and a function
called the "mapper" is applied to each chunk independently.
-The mapper function takes the input data and transforms it into a set of key-value
pairs, where the key represents
some attribute of the data and the value represents the processed data itself.
-Each mapper operates independently and processes its chunk of data in parallel
across multiple machines in a distributed environment.

Shuffle and Sort:


-After the map phase, the intermediate key-value pairs generated by the mappers are
shuffled and sorted based on their keys.
-This ensures that all values associated with the same key are grouped together.
-The shuffle and sort phase is crucial for preparing the data for the reduce phase
by ensuring that all values with the same key are sent to the same reducer.

Reduce Phase:
-In this phase, the intermediate key-value pairs generated by the mappers are input
to a function called the "reducer."
-The reducer function takes the key and the corresponding list of values (grouped
by the key) and performs
some aggregation or computation on those values to produce the final output.
-Like the mapper phase, the reduce phase operates independently and in parallel
across multiple machines.
-The output of the reducer is typically written to a file or some other storage
system.
-The reducer is intended to aggregate the many values that are output from the map
phase in order
to transform a large volume of data into a smaller, more manageable set of summary
data, but has many other uses as well.

However, more complex algorithms and analyses cannot be distilled to a single


MapReduce job.
For example, many machine learning or predictive analysis techniques require
optimization, an iterative process where error is minimized.
MapReduce does not support native iteration through a single map or reduce.

In fact, the use of multiple MapReduce jobs to perform a single computation is how
more complex applications are constructed, through a process called “job chaining.”
By creating data flows through a system of intermediate MapReduce jobs, we can
create a pipeline of analytical steps that lead us to our end result.

The MapReduce API is written in Java, and therefore MapReduce jobs submitted to the
cluster are going to be compiled Java Archive (JAR) files.
Hadoop will transmit the JAR files across the network to each node that will run a
task (either a mapper or reducer) and the individual tasks of the MapReduce job are
executed.

You might also like