Big Data Hadoop Stack
Big Data Hadoop Stack
A new approach is, we can keep all the data that we have, and we
can take that data and analyze it in new interesting ways. We can
do something that's called schema and read style.
And we can actually allow new analysis. We can bring more data
into simple algorithms, which has shown that with more
granularity, you can actually achieve often better results in taking
a small amount of data and then some really complex analytics on
it.
Big Data Computing Vu Pham Big Data Hadoop Stack
Apache Hadoop Framework
& its Basic Modules
Two major pieces of Hadoop are: Hadoop Distribute the File System and the
MapReduce, a parallel processing framework that will map and reduce data.
These are both open source and inspired by the technologies developed at
Google.
If we talk about this high level infrastructure, we start talking about things like
TaskTrackers and JobTrackers, the NameNodes and DataNodes.
Big Data Computing Vu Pham Big Data Hadoop Stack
HDFS
Hadoop distributed file system
The typical MapReduce engine will consist of a job tracker, to which client
applications can submit MapReduce jobs, and this job tracker typically pushes
work out to all the available task trackers, now it's in the cluster. Struggling to
keep the word as close to the data as possible, as balanced as possible.
Hadoop 2.0 provides a more general processing platform, that is not constraining to this
map and reduce kinds of processes.
The fundamental idea behind the MapReduce 2.0 is to split up two major functionalities
of the job tracker, resource management, and the job scheduling and monitoring, and
to do two separate units. The idea is to have a global resource manager, and per
application master manager.
Big Data Computing Vu Pham Big Data Hadoop Stack
What is Yarn ?
Yarn enhances the power of the Hadoop compute cluster, without
being limited by the map produce kind of framework.
It's scalability's great. The processing power and data centers
continue to grow quickly, because the YARN research manager
focuses exclusively on scheduling. It can manage those very large
clusters quite quickly and easily.
YARN is completely compatible with the MapReduce. Existing
MapReduce application end users can run on top of the Yarn without
disrupting any of their existing processes.
It does have a Improved cluster utilization as well. The resource
manager is a pure schedule or they just optimize this cluster
utilization according to the criteria such as capacity, guarantees,
fairness, how to be fair, maybe different SLA's or service level
agreements.
Scalability MapReduce Compatibility Improved cluster utilization
Had their original MapReduce, and they were storing and processing
large amounts of data.
Like to be able to access that data and access it in a SQL like
language. So they built the SQL gateway to adjust the data into the
MapReduce cluster and be able to query some of that data as well.