CH 2
CH 2
-data engineers - responsible for finding connections or relations within the data.
correct and clean the data.
-data modelers - focus on model generation. makes use of machine learning.
-subject matter expert - provide feedback for the model so it can be revised
2 components
HDFS - hadoop distributed file system
-interacts with memory components on all computers on distributed network
Architechture
- data is distriuted immediately by YARN when added to the cluster and stored on
multiple nodes.
nodes process data that is stored locally in order to minimize traffic across the
network.
- data is stored in blocks of fixed size (usually 128 mb) and each block is
duplicated multipe times across the system to provide redundancy and data safety.
- A computation is reffered to as a job. jobs are broken into tasks where each
individual node performs the task on a single block of data.
- Jobs are written at a high-level (abstract) without concern for network
programming, time, or low level infrastructure (each individual node)(HDFS and YARN
handle this),
allowing devs to focus on data and computation rather than distributed programming
details.
- reduce reading and writing during computation to speed up process.
- the amount of network traffic between nodes should be minimized transparently by
the system. Each task should be independent and nodes should not have to
communicate
with each other during processing to ensure that there are no interporcess
dependencies that could lead to deadlock.
- Jobs are fault tolerant usually through task redundancy, such that if a single
task fails, the final computation is not incorrect or incomplete
- Master programs allocate work to worker nodes such that many worker nodes can
operate in parallel, each on their own portion of the larger dataset.
HADOOP CLUSTERS
HDFS and YARN expose an application programming interface (API) that abstracts
developers from low-level cluster administration details.
A set of machines that is running HDFS and YARN is known as a cluster, and the
individual machines are called nodes.
A cluster can have a single node, or many thousands of nodes, but all clusters
scale horizontally, meaning as you add more nodes,
the cluster increases in both capacity and performance in a linear fashion.
daemon processes - background processes of HDFS and YARN that make any computation
of big data successful! Work as master programs that allocate work to worker nodes
allowing them to work in parallell, each on their own portion of the larger
dataset.
MapReduce runs the Daemon processes in parallell as batch, runnign them int he
background.
YARN and HDFS are implemented by several daemon processes—that is, software that
runs in the background and does not require user input.
Hadoop processes are services, meaning they run all the time on a cluster node and
accept input and deliver output through the network, similar to how an HTTP server
works.
Each of these processes runs inside of its own Java Virtual Machine (JVM) so each
daemon has its own system resource allocation and is managed independently by the
operating system.
Each node in the cluster is identified by the type of process or processes that it
runs:
Master nodes
These nodes run coordinating services for Hadoop workers and are usually the entry
points for user access to the cluster.
Without masters, coordination would fall apart, and distributed storage or
computation would not be possible.
Worker nodes
These nodes are the majority of the computers in the cluster. Worker nodes run
services that accept tasks from master nodes
—either to store or retrieve data or to run a particular application.
A distributed computation is run by parallelizing the analysis across worker nodes.
HDFS and YARN work in concert to minimize the amount of network traffic in the
cluster primarily by ensuring that data is local to the required computation.
Duplication of both data and tasks ensures fault tolerance, recoverability, and
consistency.
Moreover, the cluster is centrally managed to provide scalability and to abstract
low-level clustering programming details.
Together, HDFS and YARN are a platform upon which big data applications are built;
perhaps more than just a platform, they provide an operating system for big data.
HDFS
NameNode (Master)
-Stores the directory tree of the file system, file metadata, and the locations of
each file in the cluster.
-Clients wanting to access HDFS must first locate the appropriate storage nodes by
requesting information from the NameNode.
-The master NameNode keeps track of what blocks make up a file and where those
blocks are located.
-The NameNode communicates with the DataNodes, the processes that actually hold the
blocks in the cluster.
-Metadata associated with each file is stored in the memory of the NameNode master
for quick lookups, and if the NameNode stops or fails,
the entire cluster will become inaccessible!
-When a client application wants access to read a file, it first requests the
metadata from the NameNode to locate the blocks that make up the file,
as well as the locations of the DataNodes that store the blocks. The application
then communicates directly with the DataNodes to read the data.
Therefore, the NameNode simply acts like a journal or a lookup table and is not a
bottleneck to simultaneous reads.
-HDFS manages info about each chunk so that they are retrieved in order via file
metadata.
-NameNode does not process or store the data.
-To upload a file, request it from NameNode. The NameNode will look up where it
will be stored, divide it into chunks, and store those chunks in the DataNodes.
Info about chunks of a file is stored through Hash3
DataNode (Worker)
-Stores and manages HDFS blocks on the local disk.
-Reports health and status of individual data stores back to the NameNode.
-HDFS files are split into blocks, usually of either 64 MB or 128 MB, although this
is configurable at runtime and high-performance systems
typically select block sizes of 256 MB. The block size is the minimum amount of
data that can be read or written to in HDFS.
However, unlike blocks on a single disk, files that are smaller than the block size
do not occupy the full blocks’ worth of space on the actual file system.
Additionally, blocks will be replicated across the DataNodes. By default, the
replication is three-fold, but this is also configurable at runtime.
HDFS performs best with moderate to large number of very large files.
Storage Pattern - HDFS implements WORM (Write Once, Read Many) - No random writes
or appends to files
HDFS is optimized for large streaming reading of files, no random reading or
selection
kind of applications that work good with HDFS? -
It is not a good fit as a data backend for applications that require updates in
real-time, interactive data analysis, or record-based transactional support.
does not work well with transactional applications as they require real time
updates with each transaction to maintain consistency and integrity.
Instead, by writing data only once and reading many times, HDFS users tend to
create large stores of heterogeneous data to aid in a variety of different
computations and analytics.
These stores are sometimes called “data lakes” because they simply hold all data
about a known problem in a recoverable and fault-tolerant manner.
YARN Nodes
ResourceManager (Master)
-Allocates and monitors available cluster resources (e.g., physical assets like
memory and processor cores) to applications as well as
handling scheduling of jobs on the cluster.
ApplicationMaster (Master)
-Coordinates a particular application being run on the cluster as scheduled by the
ResourceManager.
NodeManager (Worker)
-Runs and manages processing tasks on an individual node as well as reports the
health and status of tasks as they’re running.
-clients that wish to execute a job must first request resources from the
ResourceManager, which assigns an application-specific ApplicationMaster for the
duration of the job.
The ApplicationMaster tracks the execution of the job, while the ResourceManager
tracks the status of the nodes,
and each individual NodeManager creates containers and executes tasks within them.
Note that there may be other processes running on the Hadoop cluster as well—for
example,
JobHistory servers or ZooKeeper coordinators, but these services are the primary
software running in a Hadoop cluster.
the moment the application master node ceases to recieve a live signal from a
worker node, it contacts the resource manager. Thise ensures fault tolerance, as a
backup
worker node can now take over
Master processes are so important that they usually are run on their own node so
they don’t compete for resources and present a bottleneck.
However, in smaller clusters, the master daemons may all run on a single node.
MAPREDUCE
MapReduce is a simple but very powerful computational framework specifically
designed to enable fault-tolerant distributed computation across
a cluster of centrally managed machines. It does this by employing a “functional”
programming style that is inherently parallelizable—by allowing multiple
independent tasks to execute a function on local chunks of data and aggregating the
results after processing.
multiple tasks are carried out across nodes in a cluster , no need of sharing
state!
i.e. task carried out by one node depends on any other node in terms of a sequence
Map applies a function to many things independently, before sending off its many
results to the reducer in a sequence
Reduce has two arguments(a sequence and a function that it will use to operate over
the sequence to reduce it to a single thing.)
MapReduce is stateless. variables are given as input and carried forward via
output. As such, any node thet has the data input can complete the computation
and get the same result. By chaining outputs of functions to the inputs of other
functions,
we can guarantee that we will reach a final computation across the entire dataset.
The MapReduce model consists of two main phases: the Map phase and the Reduce
phase.
Map Phase: -
-In this phase, the input data is divided into smaller chunks, and a function
called the "mapper" is applied to each chunk independently.
-The mapper function takes the input data and transforms it into a set of key-value
pairs, where the key represents
some attribute of the data and the value represents the processed data itself.
-Each mapper operates independently and processes its chunk of data in parallel
across multiple machines in a distributed environment.
Reduce Phase:
-In this phase, the intermediate key-value pairs generated by the mappers are input
to a function called the "reducer."
-The reducer function takes the key and the corresponding list of values (grouped
by the key) and performs
some aggregation or computation on those values to produce the final output.
-Like the mapper phase, the reduce phase operates independently and in parallel
across multiple machines.
-The output of the reducer is typically written to a file or some other storage
system.
-The reducer is intended to aggregate the many values that are output from the map
phase in order
to transform a large volume of data into a smaller, more manageable set of summary
data, but has many other uses as well.
In fact, the use of multiple MapReduce jobs to perform a single computation is how
more complex applications are constructed, through a process called “job chaining.”
By creating data flows through a system of intermediate MapReduce jobs, we can
create a pipeline of analytical steps that lead us to our end result.
The MapReduce API is written in Java, and therefore MapReduce jobs submitted to the
cluster are going to be compiled Java Archive (JAR) files.
Hadoop will transmit the JAR files across the network to each node that will run a
task (either a mapper or reducer) and the individual tasks of the MapReduce job are
executed.