0% found this document useful (0 votes)

152 views52 pages

Big Data Hadoop Stack

Big data computing

Uploaded by

Yaser Ali Tariq

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

152 views52 pages

Big Data Hadoop Stack

Big data computing

Uploaded by

Yaser Ali Tariq

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 52

Hadoop Stack for Big Data

Dr. Rajiv Misra

Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
[email protected]
Big Data Computing Vu Pham Big Data Hadoop Stack
Preface
Content of this Lecture:

In this lecture, we will provide insight into Hadoop

technologies opportunities and challenges for Big
Data.

We will also look into the Hadoop stack and

applications and technologies associated with Big Data
solutions.

Big Data Computing Vu Pham Big Data Hadoop Stack

Hadoop Beginnings

Big Data Computing Vu Pham Big Data Hadoop Stack

What is Hadoop ?

Apache Hadoop is an open source software

framework for storage and large scale
processing of the data-sets on clusters of
commodity hardware.

Big Data Computing Vu Pham Big Data Hadoop Stack

Hadoop Beginnings
Hadoop was created by Doug Cutting and Mike Cafarella in
2005

It was originally developed to support distribution of the

Nutch Search Engine Project.

Doug, who was working at Yahoo at the time, who is now

actually a chief architect at Cloudera, has named this project
after his son’s toy elephant, Hadoop.

Big Data Computing Vu Pham Big Data Hadoop Stack

Moving Computation to Data

Hadoop started out as a simple batch processing framework.

The idea behind Hadoop is that instead of moving data to

computation, we move computation to data.

Big Data Computing Vu Pham Big Data Hadoop Stack

Scalability

Scalability's at it's core of a Hadoop system.

We have cheap computing storage.
We can distribute and scale across very easily
in a very cost effective manner.

Big Data Computing Vu Pham Big Data Hadoop Stack

Reliability
Hardware Failures
Handles
Automatically!

If we think about an individual machine or rack of machines, or a

large cluster or super computer, they all fail at some point of time or
some of their components will fail. These failures are so common
that we have to account for them ahead of the time.
And all of these are actually handled within the Hadoop framework
system. So the Apache's Hadoop MapReduce and HDFS components
were originally derived from the Google's MapReduce and Google's
file system. Another very interesting thing that Hadoop brings is a
new approach to data.
Big Data Computing Vu Pham Big Data Hadoop Stack
New Approach to Data: Keep all data

A new approach is, we can keep all the data that we have, and we
can take that data and analyze it in new interesting ways. We can
do something that's called schema and read style.
And we can actually allow new analysis. We can bring more data
into simple algorithms, which has shown that with more
granularity, you can actually achieve often better results in taking
a small amount of data and then some really complex analytics on
it.
Big Data Computing Vu Pham Big Data Hadoop Stack
Apache Hadoop Framework
& its Basic Modules

Big Data Computing Vu Pham Big Data Hadoop Stack

Apache Framework Basic Modules
Hadoop Common: It contains libraries and utilities needed
by other Hadoop modules.
Hadoop Distributed File System (HDFS): It is a distributed
file system that stores data on a commodity machine.
Providing very high aggregate bandwidth across the entire
cluster.
Hadoop YARN: It is a resource management platform
responsible for managing compute resources in the cluster
and using them in order to schedule users and
applications.
Hadoop MapReduce: It is a programming model that
scales data across a lot of different processes.

Big Data Computing Vu Pham Big Data Hadoop Stack

Apache Framework Basic Modules

Big Data Computing Vu Pham Big Data Hadoop Stack

High Level Architecture of Hadoop

Two major pieces of Hadoop are: Hadoop Distribute the File System and the
MapReduce, a parallel processing framework that will map and reduce data.
These are both open source and inspired by the technologies developed at
Google.

If we talk about this high level infrastructure, we start talking about things like
TaskTrackers and JobTrackers, the NameNodes and DataNodes.
Big Data Computing Vu Pham Big Data Hadoop Stack
HDFS
Hadoop distributed file system

Big Data Computing Vu Pham Big Data Hadoop Stack

HDFS: Hadoop distributed file system
Distributed, scalable, and portable file-system written in
Java for the Hadoop framework.

Each node in Hadoop instance typically has a single name

node, and a cluster of data nodes that formed this HDFS
cluster.

Each HDFS stores large files, typically in ranges of

gigabytes to terabytes, and now petabytes, across
multiple machines. And it can achieve reliability by
replicating the cross multiple hosts, and therefore does
not require any range storage on hosts.
Big Data Computing Vu Pham Big Data Hadoop Stack
HDFS

Big Data Computing Vu Pham Big Data Hadoop Stack

HDFS

Big Data Computing Vu Pham Big Data Hadoop Stack

MapReduce Engine

The typical MapReduce engine will consist of a job tracker, to which client
applications can submit MapReduce jobs, and this job tracker typically pushes
work out to all the available task trackers, now it's in the cluster. Struggling to
keep the word as close to the data as possible, as balanced as possible.

Big Data Computing Vu Pham Big Data Hadoop Stack

Apache Hadoop NextGen MapReduce (YARN)
Yarn enhances the power of the
Hadoop compute cluster, without
being limited by the map produce
kind of framework.
It's scalability's great. The processing
power and data centers continue to
grow quickly, because the YARN
research manager focuses
exclusively on scheduling. It can
manage those very large clusters
quite quickly and easily.
YARN is completely compatible with
the MapReduce. Existing
MapReduce application end users
can run on top of the Yarn without
disrupting any of their existing
processes.

Big Data Computing Vu Pham Big Data Hadoop Stack

Hadoop 1.0 vs. Hadoop 2.0

Hadoop 2.0 provides a more general processing platform, that is not constraining to this
map and reduce kinds of processes.
The fundamental idea behind the MapReduce 2.0 is to split up two major functionalities
of the job tracker, resource management, and the job scheduling and monitoring, and
to do two separate units. The idea is to have a global resource manager, and per
application master manager.
Big Data Computing Vu Pham Big Data Hadoop Stack
What is Yarn ?
Yarn enhances the power of the Hadoop compute cluster, without
being limited by the map produce kind of framework.
It's scalability's great. The processing power and data centers
continue to grow quickly, because the YARN research manager
focuses exclusively on scheduling. It can manage those very large
clusters quite quickly and easily.
YARN is completely compatible with the MapReduce. Existing
MapReduce application end users can run on top of the Yarn without
disrupting any of their existing processes.
It does have a Improved cluster utilization as well. The resource
manager is a pure schedule or they just optimize this cluster
utilization according to the criteria such as capacity, guarantees,
fairness, how to be fair, maybe different SLA's or service level
agreements.
Scalability MapReduce Compatibility Improved cluster utilization

Big Data Computing Vu Pham Big Data Hadoop Stack

What is Yarn ?
It supports other work flows other than just map reduce.
Now we can bring in additional programming models, such as graph
process or iterative modeling, and now it's possible to process the
data in your base. This is especially useful when we talk about
machine learning applications.
Yarn allows multiple access engines, either open source or
proprietary, to use Hadoop as a common standard for either batch or
interactive processing, and even real time engines that can
simultaneous acts as a lot of different data, so you can put streaming
kind of applications on top of YARN inside a Hadoop architecture,
and seamlessly work and communicate between these
environments.
Fairness Iterative Modeling Multiple
Machine Access
Supports Other Workloads Learning Engines
Big Data Computing Vu Pham Big Data Hadoop Stack
The Hadoop “Zoo”

Big Data Computing Vu Pham Big Data Hadoop Stack

Apache Hadoop Ecosystem

Big Data Computing Vu Pham Big Data Hadoop Stack

Original Google Stack

Had their original MapReduce, and they were storing and processing
large amounts of data.
Like to be able to access that data and access it in a SQL like
language. So they built the SQL gateway to adjust the data into the
MapReduce cluster and be able to query some of that data as well.

Big Data Computing Vu Pham Big Data Hadoop Stack

Original Google Stack

Then, they realized they needed a high-level specific language to

access MapReduce in the cluster and submit some of those jobs. So
Sawzall came along.
Then, Evenflow came along and allowed to chain together complex
work codes and coordinate events and service across this kind of a
framework or the specific cluster they had at the time.
Big Data Computing Vu Pham Big Data Hadoop Stack
Original Google Stack

Then, Dremel came along. Dremel was a columnar storage in the

metadata manager that allows us to manage the data and is able to
process a very large amount of unstructured data.
Then Chubby came along as a coordination system that would manage
all of the products in this one unit or one ecosystem that could process
all these large amounts of structured data seamlessly.

Big Data Computing Vu Pham Big Data Hadoop Stack

Facebook’s Version of the Stack

Big Data Computing Vu Pham Big Data Hadoop Stack

Yahoo Version of the Stack

Big Data Computing Vu Pham Big Data Hadoop Stack

LinkedIn’s Version of the Stack

Big Data Computing Vu Pham Big Data Hadoop Stack

Cloudera’s Version of the Stack

Big Data Computing Vu Pham Big Data Hadoop Stack

Hadoop Ecosystem
Major Components

Big Data Computing Vu Pham Big Data Hadoop Stack

Big Data Computing Vu Pham Big Data Hadoop Stack
Apache Sqoop
Tool designed for efficiently transferring bulk
data between Apache Hadoop and structured
datastores such as relational databases

Big Data Computing Vu Pham Big Data Hadoop Stack

Big Data Computing Vu Pham Big Data Hadoop Stack
HBASE
Hbase is a key component of the Hadoop stack, as its
design caters to applications that require really fast random
access to significant data set.

Column-oriented database management system

Key-value store
Based on Google Big Table
Can hold extremely large data
Dynamic data model
Not a Relational DBMS

Big Data Computing Vu Pham Big Data Hadoop Stack

Big Data Computing Vu Pham Big Data Hadoop Stack
PIG
High level programming on top of Hadoop
MapReduce
The language: Pig Latin
Data analysis problems as data flows
Originally developed at Yahoo 2006

Big Data Computing Vu Pham Big Data Hadoop Stack

PIG for ETL

A good example of PIG applications is ETL transaction model that

describes how a process will extract data from a source, transporting
according to the rules set that we specify, and then load it into a data
store.
PIG can ingest data from files, streams, or any other sources using the
UDF: a user-defined functions that we can write ourselves.
When it has all the data it can perform, select, iterate and do kinds of
transformations.
Big Data Computing Vu Pham Big Data Hadoop Stack
Big Data Computing Vu Pham Big Data Hadoop Stack
Apache Hive
Data warehouse software facilitates querying and
managing large datasets residing in distributed
storage
SQL-like language!
Facilitates querying and managing large datasets in
HDFS
Mechanism to project structure onto this data and
query the data using a SQL-like language called
HiveQL

Big Data Computing Vu Pham Big Data Hadoop Stack

Big Data Computing Vu Pham Big Data Hadoop Stack
Oozie
Workflow
Workflow scheduler system to manage Apache
Hadoop jobs
Oozie Coordinator jobs!
Supports MapReduce, Pig, Apache Hive, and
Sqoop, etc.

Big Data Computing Vu Pham Big Data Hadoop Stack

Big Data Computing Vu Pham Big Data Hadoop Stack
Zookeeper
Provides operational services for a
Hadoop cluster group services
Centralized service for: maintaining
configuration information naming
services
Providing distributed synchronization
and providing group services

Big Data Computing Vu Pham Big Data Hadoop Stack

Flume
Distributed, reliable, and available service for efficiently collecting,
aggregating, and moving large amounts of log data
It has a simple and very flexible architecture based on streaming
data flows. It's quite robust and fall tolerant, and it's really tunable
to enhance the reliability mechanisms, fail over, recovery, and all
the other mechanisms that keep the cluster safe and reliable.
It uses simple extensible data model that allows us to apply all
kinds of online analytic applications.

Big Data Computing Vu Pham Big Data Hadoop Stack

Additional Cloudera Hadoop Components Impala

Big Data Computing Vu Pham Big Data Hadoop Stack

Impala
Cloudera, Impala was designed specifically at Cloudera, and it's a query
engine that runs on top of the Apache Hadoop. The project was officially
announced at the end of 2012, and became a publicly available, open source
distribution.
Impala brings scalable parallel database technology to Hadoop and allows
users to submit low latencies queries to the data that's stored within the
HDFS or the Hbase without acquiring a ton of data movement and
manipulation.
Impala is integrated with Hadoop, and it works within the same power
system, within the same format metadata, all the security and reliability
resources and management workflows.
It brings that scalable parallel database technology on top of the Hadoop.
It actually allows us to submit SQL like queries at much faster speeds with a
lot less latency.

Big Data Computing Vu Pham Big Data Hadoop Stack

Additional Cloudera Hadoop Components Spark
The New Paradigm

Big Data Computing Vu Pham Big Data Hadoop Stack

Spark
Apache Spark™ is a fast and general engine for large-scale data
processing
Spark is a scalable data analytics platform that incorporates primitives
for in-memory computing and therefore, is allowing to exercise some
different performance advantages over traditional Hadoop's cluster
storage system approach. And it's implemented and supports
something called Scala language, and provides unique environment
for data processing.
Spark is really great for more complex kinds of analytics, and it's great
at supporting machine learning libraries.
It is yet again another open source computing frame work and it was
originally developed at MP labs at the University of California
Berkeley and it was later donated to the Apache software foundation
where it remains today as well.

Big Data Computing Vu Pham Big Data Hadoop Stack

Spark Benefits
In contrast to Hadoop's two stage disk based MapReduce paradigm
Multi-stage in-memory primitives provides performance up to 100
times faster for certain applications.
Allows user programs to load data into a cluster's memory and query
it repeatedly
Spark is really well suited for these machined learning kinds of
applications that often times have iterative sorting in memory kinds
of computation.
Spark requires a cluster management and a distributed storage
system. So for the cluster management, Spark supports standalone
native Spark clusters, or you can actually run Spark on top of a
Hadoop yarn, or via patching mesas.
For distributor storage, Spark can interface with any of the variety of
storage systems, including the HDFS, Amazon S3.

Big Data Computing Vu Pham Big Data Hadoop Stack

Conclusion
In this lecture, we have discussed the specific components
and basic processes of the Hadoop architecture, software
stack, and execution environment.

Big Data Computing Vu Pham Big Data Hadoop Stack

Design and Implementation of A Computerised Stadium Management Information System
100% (8)
Design and Implementation of A Computerised Stadium Management Information System
32 pages
Mahmud RAHMAN - 2.2 Workbook
No ratings yet
Mahmud RAHMAN - 2.2 Workbook
20 pages
Big Data & Hadoop Training Material 0 1 PDF
50% (2)
Big Data & Hadoop Training Material 0 1 PDF
168 pages
ETL Development Standards
No ratings yet
ETL Development Standards
6 pages
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
3_Hadoop-Stack_for_BD
No ratings yet
3_Hadoop-Stack_for_BD
53 pages
Mastering Big Data and Hadoop: From Basics to Expert Proficiency
From Everand
Mastering Big Data and Hadoop: From Basics to Expert Proficiency
William Smith
No ratings yet
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
No ratings yet
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
40 pages
Unit 2
No ratings yet
Unit 2
73 pages
Hadoop Ecosystem for Big Data
From Everand
Hadoop Ecosystem for Big Data
Dr. Zemelak Goraga
No ratings yet
Big Data?: Hadoop?
No ratings yet
Big Data?: Hadoop?
2 pages
Chap3_OverviewOfBigDataEcosystem
No ratings yet
Chap3_OverviewOfBigDataEcosystem
91 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Chapter 2 Hadoop Eco System
No ratings yet
Chapter 2 Hadoop Eco System
34 pages
Day 2 S1 Intro_to_hadoop_Ashok
No ratings yet
Day 2 S1 Intro_to_hadoop_Ashok
27 pages
Hadoop Architecture and Its Functionality
No ratings yet
Hadoop Architecture and Its Functionality
7 pages
Hadoop Job Runner UI Tool
No ratings yet
Hadoop Job Runner UI Tool
10 pages
Big Data – Introduction to Hadoop
No ratings yet
Big Data – Introduction to Hadoop
61 pages
Hadoop Important Lecture
No ratings yet
Hadoop Important Lecture
38 pages
Big Data Technologies
No ratings yet
Big Data Technologies
31 pages
Mastering Hadoop
From Everand
Mastering Hadoop
Sandeep Karanth
No ratings yet
HADOOP
No ratings yet
HADOOP
10 pages
HADOOP
No ratings yet
HADOOP
55 pages
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
No ratings yet
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
53 pages
L2 AWS Basics
No ratings yet
L2 AWS Basics
56 pages
Big Data-2
No ratings yet
Big Data-2
40 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Data Science
No ratings yet
Data Science
87 pages
Advanced Hadoop Techniques: A Comprehensive Guide to Mastery
From Everand
Advanced Hadoop Techniques: A Comprehensive Guide to Mastery
Adam Jones
No ratings yet
Part2 HDFS
No ratings yet
Part2 HDFS
33 pages
Module 1 - Introduction To Big Data
100% (1)
Module 1 - Introduction To Big Data
40 pages
Module 2. 16974328568170
No ratings yet
Module 2. 16974328568170
113 pages
UNIT2 BDA
No ratings yet
UNIT2 BDA
12 pages
DBMS Unit-5
No ratings yet
DBMS Unit-5
92 pages
Bachelor of Engineering: C K Pithawalla College of Engineering & Technology, SURAT
No ratings yet
Bachelor of Engineering: C K Pithawalla College of Engineering & Technology, SURAT
14 pages
Hadoop, A Distributed Framework For Big Data
No ratings yet
Hadoop, A Distributed Framework For Big Data
55 pages
Unit 5
No ratings yet
Unit 5
35 pages
Hadoop Blueprints
From Everand
Hadoop Blueprints
Anurag Shrivastava
No ratings yet
Unit 3 ETI (BDA)
No ratings yet
Unit 3 ETI (BDA)
34 pages
0 The BigDataEra
No ratings yet
0 The BigDataEra
36 pages
Hadoop Notes 1
No ratings yet
Hadoop Notes 1
9 pages
Apache Hadoop and Spark:: and Use Cases For Data Analysis
No ratings yet
Apache Hadoop and Spark:: and Use Cases For Data Analysis
48 pages
DATA228 Lecture Notes Week 3
No ratings yet
DATA228 Lecture Notes Week 3
21 pages
Chapter 2 Introduction To Hadoop
No ratings yet
Chapter 2 Introduction To Hadoop
31 pages
HADOOP NOTES
No ratings yet
HADOOP NOTES
8 pages
Apache Hadoop
No ratings yet
Apache Hadoop
27 pages
Bda 18CS72 Mod-2
No ratings yet
Bda 18CS72 Mod-2
152 pages
Hadoop, A Distributed Framework For Big Data
No ratings yet
Hadoop, A Distributed Framework For Big Data
55 pages
Introduction To Hadoop Slides
No ratings yet
Introduction To Hadoop Slides
111 pages
Unit 2 (1)
No ratings yet
Unit 2 (1)
22 pages
CASE STUDY On Application of Hadoop
No ratings yet
CASE STUDY On Application of Hadoop
16 pages
Chapter - 2 Hadoop
No ratings yet
Chapter - 2 Hadoop
32 pages
unit 2
No ratings yet
unit 2
28 pages
BDA Unit 2
No ratings yet
BDA Unit 2
39 pages
Intro Hadoop Ecosystem Components, Hadoop Ecosystem Tools
No ratings yet
Intro Hadoop Ecosystem Components, Hadoop Ecosystem Tools
15 pages
Hadoop and Mapreduce
No ratings yet
Hadoop and Mapreduce
21 pages
Big Data Module 2
No ratings yet
Big Data Module 2
23 pages
BD Notes
No ratings yet
BD Notes
11 pages
1 - Big Data and Hadoop Framework
No ratings yet
1 - Big Data and Hadoop Framework
40 pages
The Age OF: Every Minute
No ratings yet
The Age OF: Every Minute
47 pages
Exploring Bigdata With Hadoop: Dr.A.Bazila Banu Associate Professor Department of Cse
No ratings yet
Exploring Bigdata With Hadoop: Dr.A.Bazila Banu Associate Professor Department of Cse
23 pages
Hadoop Ankit
No ratings yet
Hadoop Ankit
20 pages
Analyzing Big Data in Hadoop Spark
No ratings yet
Analyzing Big Data in Hadoop Spark
30 pages
GST Tax Invoice Format For Goods
No ratings yet
GST Tax Invoice Format For Goods
3 pages
Generation of Electricity With The Help of Speed Breakers Analysis and Applications
No ratings yet
Generation of Electricity With The Help of Speed Breakers Analysis and Applications
6 pages
REVISEDAcademic Calendar 2019 20
No ratings yet
REVISEDAcademic Calendar 2019 20
1 page
ScienceFare SpeedBreaker
No ratings yet
ScienceFare SpeedBreaker
11 pages
Speedbreaker ICECEBUET
No ratings yet
Speedbreaker ICECEBUET
5 pages
1897 B.techIIIYearRegularuptoaugust2019Atte
No ratings yet
1897 B.techIIIYearRegularuptoaugust2019Atte
3 pages
BTech ECE Syllabus 2017
No ratings yet
BTech ECE Syllabus 2017
139 pages
Lec28 DCT
100% (1)
Lec28 DCT
14 pages
Unit V1
No ratings yet
Unit V1
106 pages
Intrusion Detection System in Software Defined Networks Using Machine Learning Approach
No ratings yet
Intrusion Detection System in Software Defined Networks Using Machine Learning Approach
8 pages
CISA-Certified-Information-Systems-Auditor_174752_Euro-Training
No ratings yet
CISA-Certified-Information-Systems-Auditor_174752_Euro-Training
4 pages
CS8492-Database Management Systems-UNIT 5
100% (1)
CS8492-Database Management Systems-UNIT 5
20 pages
MDX Tutorial
100% (1)
MDX Tutorial
5 pages
The Procurement Process
No ratings yet
The Procurement Process
83 pages
The DB2 Admin 004
No ratings yet
The DB2 Admin 004
3 pages
SAP S - 4HANA Cloud, Extended Edition Roles and Responsibilities For Project
0% (1)
SAP S - 4HANA Cloud, Extended Edition Roles and Responsibilities For Project
7 pages
09 - Builder Pattern - Week7
No ratings yet
09 - Builder Pattern - Week7
43 pages
Ceh Practice Questions
No ratings yet
Ceh Practice Questions
11 pages
What Are The Differences Between Connected and Unconnected Lookup?
No ratings yet
What Are The Differences Between Connected and Unconnected Lookup?
34 pages
Week 09 (Role of IT in SCM)
No ratings yet
Week 09 (Role of IT in SCM)
37 pages
Skill Matrix - Management Consultant
No ratings yet
Skill Matrix - Management Consultant
1 page
A Project Report On: Waterfall Model of Customer Relationship Management (CRM) To Improve Academic Information Systems
No ratings yet
A Project Report On: Waterfall Model of Customer Relationship Management (CRM) To Improve Academic Information Systems
5 pages
Ch1 DBMS Navathe
100% (1)
Ch1 DBMS Navathe
24 pages
SQ L Cookbook
0% (2)
SQ L Cookbook
1 page
COBOL IMS DB Sample Program
No ratings yet
COBOL IMS DB Sample Program
9 pages
ICD Data Schema comparision
No ratings yet
ICD Data Schema comparision
9 pages
Cloud Platforms PDF
No ratings yet
Cloud Platforms PDF
12 pages
Superior University Lahore: Faculty of Computer Science & IT
No ratings yet
Superior University Lahore: Faculty of Computer Science & IT
11 pages
Linux Foundation's LFCS and LFCE Certification Preparation Guide ($39)
No ratings yet
Linux Foundation's LFCS and LFCE Certification Preparation Guide ($39)
29 pages
402-Sqp-Information Technology PDF
No ratings yet
402-Sqp-Information Technology PDF
5 pages
Magic Quadrant For Observability Platforms ESPANOLr
No ratings yet
Magic Quadrant For Observability Platforms ESPANOLr
27 pages
Readme
No ratings yet
Readme
2 pages
2018 NIST CSF Maturity Tool v1.0
100% (1)
2018 NIST CSF Maturity Tool v1.0
46 pages
Planning and Administering Sharepoint 2016: Id Moc 20339-1 Price 2,590.
No ratings yet
Planning and Administering Sharepoint 2016: Id Moc 20339-1 Price 2,590.
8 pages
Cyber Security
No ratings yet
Cyber Security
48 pages
Imonit - Core Banking System Monitoring Tool
No ratings yet
Imonit - Core Banking System Monitoring Tool
70 pages

Big Data Hadoop Stack

Uploaded by

Big Data Hadoop Stack

Uploaded by

Hadoop Stack for Big Data

Dr. Rajiv Misra

In this lecture, we will provide insight into Hadoop

We will also look into the Hadoop stack and

Big Data Computing Vu Pham Big Data Hadoop Stack

Big Data Computing Vu Pham Big Data Hadoop Stack

Apache Hadoop is an open source software

Big Data Computing Vu Pham Big Data Hadoop Stack

It was originally developed to support distribution of the

Doug, who was working at Yahoo at the time, who is now

Big Data Computing Vu Pham Big Data Hadoop Stack

Hadoop started out as a simple batch processing framework.

The idea behind Hadoop is that instead of moving data to

Big Data Computing Vu Pham Big Data Hadoop Stack

Scalability's at it's core of a Hadoop system.

Big Data Computing Vu Pham Big Data Hadoop Stack

If we think about an individual machine or rack of machines, or a

Big Data Computing Vu Pham Big Data Hadoop Stack

Big Data Computing Vu Pham Big Data Hadoop Stack

Big Data Computing Vu Pham Big Data Hadoop Stack

Big Data Computing Vu Pham Big Data Hadoop Stack

Each node in Hadoop instance typically has a single name

Each HDFS stores large files, typically in ranges of

Big Data Computing Vu Pham Big Data Hadoop Stack

Big Data Computing Vu Pham Big Data Hadoop Stack

Big Data Computing Vu Pham Big Data Hadoop Stack

Big Data Computing Vu Pham Big Data Hadoop Stack

Big Data Computing Vu Pham Big Data Hadoop Stack

Big Data Computing Vu Pham Big Data Hadoop Stack

Big Data Computing Vu Pham Big Data Hadoop Stack

Big Data Computing Vu Pham Big Data Hadoop Stack

Then, they realized they needed a high-level specific language to

Then, Dremel came along. Dremel was a columnar storage in the

Big Data Computing Vu Pham Big Data Hadoop Stack

Big Data Computing Vu Pham Big Data Hadoop Stack

Big Data Computing Vu Pham Big Data Hadoop Stack

Big Data Computing Vu Pham Big Data Hadoop Stack

Big Data Computing Vu Pham Big Data Hadoop Stack

Big Data Computing Vu Pham Big Data Hadoop Stack

Big Data Computing Vu Pham Big Data Hadoop Stack

Column-oriented database management system

Big Data Computing Vu Pham Big Data Hadoop Stack

Big Data Computing Vu Pham Big Data Hadoop Stack

A good example of PIG applications is ETL transaction model that

Big Data Computing Vu Pham Big Data Hadoop Stack

Big Data Computing Vu Pham Big Data Hadoop Stack

Big Data Computing Vu Pham Big Data Hadoop Stack

Big Data Computing Vu Pham Big Data Hadoop Stack

Big Data Computing Vu Pham Big Data Hadoop Stack

Big Data Computing Vu Pham Big Data Hadoop Stack

Big Data Computing Vu Pham Big Data Hadoop Stack

Big Data Computing Vu Pham Big Data Hadoop Stack

Big Data Computing Vu Pham Big Data Hadoop Stack

Big Data Computing Vu Pham Big Data Hadoop Stack

You might also like