0% found this document useful (0 votes)

39 views

Module - 2 Half

Big data analytics

Uploaded by

s8143152

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views

Module - 2 Half

Big data analytics

Uploaded by

s8143152

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Module -2

Introduction to Hadoop
2.1 Big Data Programming Model
A programming model is centralized computing of data in which the data is transferred from
multiple distributed data sources to a central server. Analyzing, reporting, visualizing, business-
intelligence tasks compute centrally. Data are inputs to the central server.

Another programming model is distributed computing that uses the databases at multiple
computing nodes with data sharing between the nodes during computation. Distributed
computing in this model requires the cooperation (sharing) between the DBs in a transparent
manner. Transparent means that each user within the system may access all the data within all
databases as if they were a single database. A second requirement is location independence.
Analysis results should be independent of geographical locations. The access of one computing
node to other nodes may fail due to a single link failure.

Distributed pieces of codes as well as the data at the computing nodes Transparency between
data nodes at computing nodes do not fulfil for Big Data when distributed computing takes place
using data sharing between local and remote. Following are the reasons for this:

• Distributed data storage systems do not use the concept of joins.

• Data need to be fault-tolerant and data stores should take into account the possibilities of
network failure. When data need to be partitioned into data blocks and written at one set of
nodes, then those blocks need replication at multiple nodes. This takes care of possibilities of
network faults. When a network fault occurs, then replicated node makes the data available.

Big Data follows a theorem known as the CAP theorem. The CAP states that out of three
properties (consistency, availability and partitions), two must at least be present for applications,
services and processes.

i. Big Data Store Model

A model for Big Data store is as follows:

Data store in file system consisting of data blocks (physical division of data). The data blocks
are distributed across multiple nodes. Data nodes are at the racks of a cluster. Racks are scalable.

1
A Rack has multiple data nodes (data servers), and each cluster is arranged in a number of racks.

Data Store model of files in data nodes in racks in the clusters Hadoop system uses the data store
model in which storage is at clusters, racks, data nodes and data blocks. Data blocks replicate at
the DataNodes such that a failure of link leads to access of the data block from the other nodes
replicated at the same or other racks.

ii. Big Data Programming Model

Big Data programming model is that application in which application jobs and tasks (or sub-
tasks) is scheduled on the same servers which store the data for processing.

2.2 Hadoop and its echo system

Hadoop is a computing environment in which input data stores, processes and stores the results.
The environment consists of clusters which distribute at the cloud or set of servers. Each cluster
consists of a string of data files constituting data blocks. The toy named Hadoop consisted of a
stuffed elephant. The Hadoop system cluster stuffs files in data blocks. The complete system
consists of a scalable distributed set of clusters.

Infrastructure consists of cloud for clusters. A cluster consists of sets of computers or PCs. The
Hadoop platform provides a low cost Big Data platform, which is open source and uses cloud
services. Tera Bytes of data processing takes just few minutes. Hadoop enables distributed
processing of large datasets (above 10 million bytes) across clusters of computers using a
programming model called MapReduce. The system characteristics are scalable, self-
manageable, self-healing and distributed file system.

Scalable means can be scaled up (enhanced) by adding storage and processing units as per the
requirements. Self-manageable means creation of storage and processing resources which are
used, scheduled and reduced or increased with the help of the system itself. Self-healing means
that in case of faults, they are taken care of by the system itself. Self-healing enables functioning
and resources availability. Software detect and handle failures at the task level. Software enable
the service or task execution even in case of communication or node failure.

2
Hadoop Core Components

Figure 2.1 Core components of Hadoop

The Hadoop core components of the framework are:

Hadoop Common - The common module contains the libraries and utilities that are required
by the other modules of Hadoop. For example, Hadoop common provides various components
and interfaces for distributed file system and general input/output. This includes serialization,
Java RPC (Remote Procedure Call) and file-based data structures.

Hadoop Distributed File System (HDFS) - A Java-based distributed file system which can
store all kinds of data on the disks at the clusters.

MapReduce vl - Software programming model in Hadoop 1 using Mapper and Reducer. The
vl processes large sets of data in parallel and in batches.

YARN - Software for managing resources for computing. The user application tasks or sub-
tasks run in parallel at the Hadoop, uses scheduling and handles the requests for the resources
in distributed running of the tasks.

MapReduce v2 - Hadoop 2 YARN-based system for parallel processing of large datasets and
distributed processing of the application tasks.

2.2.2 Features of Hadoop

Hadoop features are as follows:

1. Fault-efficient scalable, flexible and modular design which uses simple and modular
programming model. The system provides servers at high scalability. The system is scalable by
adding new nodes to handle larger data. Hadoop proves very helpful in storing, managing,

3
processing and analyzing Big Data.

2. Robust design of HDFS: Execution of Big Data applications continue even when an
individual server or cluster fails. This is because of Hadoop provisions for backup (due to
replications at least three times for each data block) and a data recovery mechanism. HDFS thus
has high reliability.

3. Store and process Big Data: Processes Big Data of 3V characteristics.

4. Distributed clusters computing model with data locality: Processes Big Data at high speed
as the application tasks and sub-tasks submit to the DataNodes. One can achieve more
computing power by increasing the number of computing nodes. The processing splits across
multiple DataNodes (servers), and thus fast processing and aggregated results.

5. Hardware fault-tolerant: A fault does not affect data and application processing. If a node
goes down, the other nodes take care of the residue. This is due to multiple copies of all data
blocks which replicate automatically. Default is three copies of data blocks.

6. Open-source framework: Open source access and cloud services enable large data store.
Hadoop uses a cluster of multiple inexpensive servers or the cloud.

7. Java and Linux based: Hadoop uses Java interfaces. Hadoop base is Linux but has its own
set of shell commands support.

2.2.3. Hadoop Eco system Components

The four layers in Figure 2.2 are as follows:

(i) Distributed storage layer

(ii) Resource-manager layer for job or application sub-tasks scheduling and execution

(iii) Processing-framework layer, consisting of Mapper and Reducer for the MapReduce
process-flow.

(iv) APis at application support layer (applications such as Hive and Pig). The codes
communicate and run using MapReduce or YARN at processing framework layer. Reducer
output communicate to APis (Figure 2.2).

4
Figure 2.2 Hadoop main components and ecosystem components

AVRO enables data serialization between the layers. Zookeeper enables coordination among
layer components.

The holistic view of Hadoop architecture provides an idea of implementation of Hadoop

components of the ecosystem. Client hosts run applications using Hadoop ecosystem projects,
such as Pig, Hive and Mahout.

2.3 HADOOP DISTRIBUTED FILE SYSTEM

HDFS is a core component of Hadoop. HDFS is designed to run on a cluster of computers and
servers at cloud-based utility services.

HDFS stores Big Data which may range from GBs (1 GB= 230 B) to PBs (1 PB=1015 B, nearly
the 250 B). HDFS stores the data in a distributed manner in order to compute fast. The distributed
data store in HDFS stores data in any format regardless of schema.

2.3.1 HDFS Storage

Hadoop data store concept implies storing the data at a number of dusters. Each cluster has a
number of data stores, called racks. Each rack stores a number of Data Nodes. Each Data Node
has a large number of data blocks. The racks distribute across a cluster. The nodes have
processing and storage capabilities. The nodes have the data in data blocks to run the application
tasks. The data blocks replicate by default at least on three DataNodes in same or remote nodes.

5
Data at the stores enable running the distributed applications including analytics, data mining,
OLAP using the clusters. A file, containing the data divides into data blocks. A data block
default size is 64 MBs

Hadoop HDFS features are as follows

i. Create, append, delete, rename and attribute modification functions

ii. Content of individual file cannot be modified or replaced but appended with new data at

the end of the file

iii. Write once but read many times during usages and processing

iv. Average file size can be more than 500 MB.

Figure 2.3 A Hadoop cluster example,

Consider a data storage for University students. Each student data, stuData which is in a file of
size less than 64 MB (1 MB= 220 B). A data block stores the full file data for a student of
stuData_idN, whereN = 1 to 500.

How the files of each student will be distributed at a Hadoop cluster? How many student data can
be stored at one cluster? Assume that each rack has two DataNodes for processing each of 64
GB (1 GB= 230 B) memory. Assume that cluster consists of 120 racks, and thus 240
DataNodes.

6
i. What is the total memory capacity of the cluster in TB ((1 TB= 240 B) and DataNodes
in each rack?

ii. Show the distributed blocks for students with ID= 96 and 1025. Assume default
replication in the DataNodes = 3.

iii. What shall be the changes when a stuData file sizes 128 MB?

SOLUTION
i. Data block default size is 64 MB. Each student’s file size is less than 64MB. Therefore,
for each student file one data block suffices. A data block is in a Data Node. Assume, for
simplicity, each rack has two nodes each of memory capacity = 64 GB. Each node can thus store
64 GB/64MB = 1024 data blocks = 1024 student files. Each rack can thus store 2 x 64 GB/64MB
= 2048 data blocks = 2048 student files. Each data block default replicates three times in the
Data Nodes. Therefore, the number of students whose data can be stored in the cluster = number
of racks multiplied by number of files divided by 3 = 120 x 2048/3 = 81920. Therefore, the
maximum number of 81920 stuData_IDN files can be distributed per cluster, with N = 1 to
81920.

ii. Total memory capacity of the cluster = 120 x 128 MB = 15360 GB = 15 TB. Total
memory capacity of each Data Node in each rack= 1024 x 64 MB= 64 GB.

iii. Figure 2.3 shows a Hadoop cluster example, and the replication of data blocks in racks
for two students of IDs 96 and 1025. Each stuData file stores at two data blocks, of capacity
64 MB each.

iv. Changes will be that each node will have half the number of data blocks.

2.3.1.1 Hadoop Physical organization

Figure 2.4 shows the client, master Name Node, primary and secondary Master Nodes and slave
nodes in the Hadoop physical architecture. Clients as the users run the application with the help
of Hadoop ecosystem projects. For example, Hive, Mahout and Pig are the ecosystem's projects.
They are not required to be present at the Hadoop cluster.

7
Figure 2.4 The client, master Name Node, Master Nodes and slave nodes

A single Master Node provides HDFS, MapReduce and Hbase using threads in small to medium
sized clusters. When the cluster size is large, multiple servers are used, such as to balance the
load. The secondary Name Node provides Name Node management services and Zookeeper is
used by HBase for metadata storage.

The Master Node fundamentally plays the role of a coordinator. The Master Node receives client
connections, maintains the description of the global file system namespace, and the allocation
of file blocks. It also monitors the state of the system in order to detect any failure. The Masters
consists of three components Name Node, Secondary Name Node and Job Tracker. The Name
Node stores all the file system related information such as:

• The file section is stored in which part of the cluster

• Last access time for the files

• User permissions like which user has access to the file.

Secondary Name Node is an alternate for Name Node. Secondary node keeps a copy of Name
Node meta data. Thus, stored meta data can be rebuilt easily, in case of Name Node failure The
Job Tracker coordinates the parallel processing of data.

8
Hadoop 2

• Single Name Node failure in Hadoop 1 is an operational limitation.

• Scaling up was restricted to scale beyond a few thousands of Data Nodes and
number of Clusters.
• Hadoop 2 p r o v i d e s the m u l t i p l e Name Nodes which enables higher resources
availability

2.3.1.2 HDFC Commands

2.4 MAPREDUCE FRAMEWORK AND PROGRAMMING MODEL

Mapper means software for doing the assigned task after organizing the data blocks imported
using the keys. A key specifies in a command line of Mapper. The command maps the key to
the data, which an application uses.

Reducer means software for reducing the mapped data by using the aggregation, query or user-
specified function. The reducer provides a concise cohesive response for the application.

Aggregation function means the function that groups the values of multiple rows together to

9
result a single value of more significant meaning or measurement. For example, function such
as count, sum, maximum, minimum, deviation and standard deviation.

Querying function means a function that finds the desired values. For example, function for
finding a best student of a class who has shown the best performance in examination.

MapReduce allows writing applications to process reliably the huge amounts of data, in
parallel, on large clusters of servers. The cluster size does not limit as such to process in parallel.
The parallel programs of MapReduce are useful for performing large scale data analysis using
multiple machines in the cluster.

Features of Map Reduce framework are as follows:

• Provides automatic parallelization and distribution of computation based on several

processors

• Processes data stored on distributed clusters of Data Nodes and racks

• Allows processing large amount of data in parallel

• Provides scalability for usages of large number of servers

• Provides Map Reduce batch-oriented programming model in Hadoop version 1

• Provides additional processing modes in Hadoop 2 YARN-based system and enables

required parallel processing. For example, for queries, graph databases, streaming
data, messages, real-time OLAP and ad hoc analytics with Big Data 3V
characteristics.

2.5 HADOOP YARN

• YARN is a resource a management platform. It manages the computer resources.

• YARN manages the schedules for running the sub tasks. Each sub tasks uses the
resources in the allotted interval time.

• YARN separates the resources management and processing components.

• It stands for YET ANOTHER RESOURCE NEGOTIATOR , it manages and allocates

resources for the application sub tasks and submit the resources for them in the Hadoop
system.

10
Hadoop 2 Execution Model

Figure 2.5 YARN based Execution Model

The figure shows the YARN components-Client, Resource Manager (RM), Node Manager
(NM), Application Master (AM) and Containers.

Figure 2.5 also illustrates YARN components namely, Client, Resource Manager (RM), Node
Manager (RM), Application Master (AM) and Containers.

List of actions of YARN resource allocation and scheduling functions is as follows:

A Master Node has two components: (i) Job History Server and (ii) Resource Manager(RM).

A Client Node submits the request of an application to the RM. The RM is the master. One RM
exists per cluster. The RM keeps information of all the slave NMs. Information is about the
location (Rack Awareness) and the number of resources (data blocks and servers) they have. The
RM also renders the Resource Scheduler service that decides how to assign the resources. It,
therefore, performs resource management as well as scheduling.

Multiple NMs are at a cluster. An NM creates an AM instance (AMI) and starts up. The AMI
initializes itself and registers with the RM. Multiple AMis can be created in an AM.

The AMI performs role of an Application Manager (ApplM), that estimates the resources
requirement for running an application program or sub- task. The ApplMs send their requests

11
for the necessary resources to the RM. Each NM includes several containers for uses by the
subtasks of the application.

NM is a slave of the infrastructure. It signals whenever it initializes. All active NMs send the
controlling signal periodically to the RM signaling their presence.

2.6 HADOOP ECOSYSTEM TOOLS

Zoo Keeper- Provisions high-performance coordination service for distributedrunning of
Coordination service applications and tasks

Avro-Data Provisions data serialization during data transfer between application and
serializationand processing layers
transferutility

Oozie Provides a way to package and bundles multiple coordinator andworkflow jobs
and manage the lifecycle of those jobs
Sqoop (SQL-to-
Provisions for data-transfer between data stores such as relational DBsand
Hadoop)-A data-
Hadoop
transfersoftware

Flume - Large data Provisions for reliable data transfer and provides for recovery in case offailure.
transfer utility Transfers large amount of data in applications, such as related to social-media
messages
Ambari-A Provisions, monitors, manages, and viewing of functioning of thecluster,
web-based tool MapReduce, Hive and Pig APis
Chukwa-A data
Provisions and manages data collection system for large and distributedsystems
collectionsystem
HBase-A structured Provisions a scalable and structured database for large tables.
data store using
database
Cassandra - A database Provisions scalable and fault-tolerant database for multiple masters

Hive -A datawarehouse Provisions data aggregation, data-summarization, data warehouse

system infrastructure, ad hoc (unstructured) querying and SQL-like scripting language
for query processing using HiveQL
Pig-A high- level
Provisions dataflow (DF) functionality and the execution framework forparallel
dataflow
language computations

Mahout-A Provisions scalable machine learning and library functions for datamining and
analytics

Unit Iii
No ratings yet
Unit Iii
20 pages
3 - Teacher Interview Reflection Paper
67% (3)
3 - Teacher Interview Reflection Paper
5 pages
Studies in The Life of Joshua Lesson One
No ratings yet
Studies in The Life of Joshua Lesson One
3 pages
D'Ancona - The Libraries of The Neoplatonists
100% (5)
D'Ancona - The Libraries of The Neoplatonists
573 pages
Module 2 BDA
No ratings yet
Module 2 BDA
64 pages
BDA Mod2@AzDOCUMENTS - in
No ratings yet
BDA Mod2@AzDOCUMENTS - in
64 pages
Bda (21cs71) Module-2
No ratings yet
Bda (21cs71) Module-2
64 pages
2-Notes
No ratings yet
2-Notes
61 pages
Module 2. 16974328568170
No ratings yet
Module 2. 16974328568170
113 pages
Module-2
No ratings yet
Module-2
23 pages
Bda 18CS72 Mod-2
No ratings yet
Bda 18CS72 Mod-2
152 pages
Module 2 CN
No ratings yet
Module 2 CN
23 pages
Bda Module 2
No ratings yet
Bda Module 2
12 pages
Big Data Module 2
No ratings yet
Big Data Module 2
23 pages
Module-2 - Introduction To Hadoop
No ratings yet
Module-2 - Introduction To Hadoop
13 pages
Bigdata Module2 7th-Sem 18cs72
No ratings yet
Bigdata Module2 7th-Sem 18cs72
64 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
BDA Module 2
No ratings yet
BDA Module 2
40 pages
Big Data Analytics Unit-3
No ratings yet
Big Data Analytics Unit-3
15 pages
unit 2
No ratings yet
unit 2
9 pages
Unit 2 Big Data Notes
No ratings yet
Unit 2 Big Data Notes
21 pages
Big Data Unit II
No ratings yet
Big Data Unit II
42 pages
Big Data - Unit 2 Hadoop Framework
100% (1)
Big Data - Unit 2 Hadoop Framework
19 pages
UNIT II
No ratings yet
UNIT II
30 pages
BDA Unit-3
No ratings yet
BDA Unit-3
47 pages
Hadoop
No ratings yet
Hadoop
11 pages
BIG Data_Unit_2
No ratings yet
BIG Data_Unit_2
24 pages
M2
No ratings yet
M2
28 pages
Syllabus:: Introduction To Hadoop (T1)
No ratings yet
Syllabus:: Introduction To Hadoop (T1)
23 pages
BD - Unit - II - Hadoop Frameworks and HDFS
No ratings yet
BD - Unit - II - Hadoop Frameworks and HDFS
37 pages
INTRODUCTION TO DATA SCIENCE
No ratings yet
INTRODUCTION TO DATA SCIENCE
14 pages
Unit 2-1
No ratings yet
Unit 2-1
43 pages
Unit-III (Big Data) Final
No ratings yet
Unit-III (Big Data) Final
34 pages
BD Unit-02
No ratings yet
BD Unit-02
16 pages
CC UNIT 2 (1)
No ratings yet
CC UNIT 2 (1)
29 pages
Bda Lab Manual
0% (1)
Bda Lab Manual
40 pages
Session3_4-Bigdata Tools and Movie use case
No ratings yet
Session3_4-Bigdata Tools and Movie use case
79 pages
CC Unit - 5
No ratings yet
CC Unit - 5
27 pages
Unit 3 Bda
No ratings yet
Unit 3 Bda
13 pages
Unit 3 Introduction To Hadoop Syllabus
No ratings yet
Unit 3 Introduction To Hadoop Syllabus
22 pages
Unit 2
No ratings yet
Unit 2
56 pages
02 Unit-II Hadoop Architecture and HDFS
No ratings yet
02 Unit-II Hadoop Architecture and HDFS
18 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
5 pages
BDA Module 2 Chapter 1
No ratings yet
BDA Module 2 Chapter 1
12 pages
Report On An Exploratory Analysis of The
No ratings yet
Report On An Exploratory Analysis of The
19 pages
BDA Unit 3
No ratings yet
BDA Unit 3
6 pages
School of Computer Engineering: Kalinga Institute of Industrial Technology Deemed To Be University Bhubaneswar-751024
No ratings yet
School of Computer Engineering: Kalinga Institute of Industrial Technology Deemed To Be University Bhubaneswar-751024
260 pages
BDA-Module2
No ratings yet
BDA-Module2
43 pages
Unit 2
No ratings yet
Unit 2
23 pages
SDCBDASPARKWEEK1-1
No ratings yet
SDCBDASPARKWEEK1-1
9 pages
Big Data – Introduction to Hadoop
No ratings yet
Big Data – Introduction to Hadoop
61 pages
Chapter 2 Hadoop Eco System
No ratings yet
Chapter 2 Hadoop Eco System
34 pages
BigData Unit 2
No ratings yet
BigData Unit 2
56 pages
Hadoop Ecosystem: An Introduction: Sneha Mehta, Viral Mehta
No ratings yet
Hadoop Ecosystem: An Introduction: Sneha Mehta, Viral Mehta
6 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Hadoop Notes 1
No ratings yet
Hadoop Notes 1
9 pages
CC-KML051-Unit V
No ratings yet
CC-KML051-Unit V
17 pages
BDA Notes
No ratings yet
BDA Notes
25 pages
Hadoop in bigdata processing concept
No ratings yet
Hadoop in bigdata processing concept
2 pages
HADOOP ECOSSYTEM, COMPONENTS, Loading, Getting Data From Hadoop
No ratings yet
HADOOP ECOSSYTEM, COMPONENTS, Loading, Getting Data From Hadoop
10 pages
Unit 3 ETI (BDA)
No ratings yet
Unit 3 ETI (BDA)
34 pages
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
S7 Address Details
No ratings yet
S7 Address Details
4 pages
Verdict On Qidyanee
No ratings yet
Verdict On Qidyanee
12 pages
Simultaneous 21.10.2024
No ratings yet
Simultaneous 21.10.2024
3 pages
09 Taking Notes From Textbooks
No ratings yet
09 Taking Notes From Textbooks
3 pages
Compressive Coded Modulation For Seamless Rate Adaptation: Ravindra Padsala (140010741014)
No ratings yet
Compressive Coded Modulation For Seamless Rate Adaptation: Ravindra Padsala (140010741014)
18 pages
TENSE PRACTICE SET-1 (Student)
No ratings yet
TENSE PRACTICE SET-1 (Student)
13 pages
Number Representation & Iseethec : Computer Science 61C Spring 2019 Weaver
No ratings yet
Number Representation & Iseethec : Computer Science 61C Spring 2019 Weaver
58 pages
Using The Braille Maths Code 2007 Tc21086p - PDF - Litre - Fraction (Mathematics)
0% (1)
Using The Braille Maths Code 2007 Tc21086p - PDF - Litre - Fraction (Mathematics)
48 pages
Expressing Time
No ratings yet
Expressing Time
1 page
Cordova Tutorial PDF
No ratings yet
Cordova Tutorial PDF
101 pages
Ge Elect
No ratings yet
Ge Elect
5 pages
Module 9 - Ms Excel
No ratings yet
Module 9 - Ms Excel
9 pages
The Stone The Builders Rejected
No ratings yet
The Stone The Builders Rejected
2 pages
ZOPT - BloombergTestOption6 - FDCS01 e Tabela Z Com Delta
No ratings yet
ZOPT - BloombergTestOption6 - FDCS01 e Tabela Z Com Delta
3 pages
LP F3 Poisoned Talk
No ratings yet
LP F3 Poisoned Talk
8 pages
Prophetic Books
No ratings yet
Prophetic Books
26 pages
3468 Compiler Construction
No ratings yet
3468 Compiler Construction
5 pages
Indra in The Rig Veda
100% (1)
Indra in The Rig Veda
93 pages
DPP No. 01 Topic: Number System 1
No ratings yet
DPP No. 01 Topic: Number System 1
24 pages
A Hindi Speech Recognition System For Connected Wo
No ratings yet
A Hindi Speech Recognition System For Connected Wo
8 pages
? Nana MBTI Personality Types 2
No ratings yet
? Nana MBTI Personality Types 2
1 page
Pembaharuan Pendidikan Islam Di Indonesia: Sumarno
No ratings yet
Pembaharuan Pendidikan Islam Di Indonesia: Sumarno
25 pages
Reported Speech
No ratings yet
Reported Speech
14 pages
Palestine in Month Magazine Edition
No ratings yet
Palestine in Month Magazine Edition
6 pages
The 8 Parts of Speech: A. Nouns B. Pronouns
No ratings yet
The 8 Parts of Speech: A. Nouns B. Pronouns
2 pages
Purposive Communication (Parts of Research Proposal)
No ratings yet
Purposive Communication (Parts of Research Proposal)
2 pages
Global Location Niche Skill Set Insight
No ratings yet
Global Location Niche Skill Set Insight
11 pages

Module - 2 Half

Uploaded by

Module - 2 Half

Uploaded by

Module -2

• Distributed data storage systems do not use the concept of joins.

i. Big Data Store Model

A model for Big Data store is as follows:

ii. Big Data Programming Model

2.2 Hadoop and its echo system

Figure 2.1 Core components of Hadoop

The Hadoop core components of the framework are:

2.2.2 Features of Hadoop

Hadoop features are as follows:

3. Store and process Big Data: Processes Big Data of 3V characteristics.

2.2.3. Hadoop Eco system Components

The four layers in Figure 2.2 are as follows:

(i) Distributed storage layer

The holistic view of Hadoop architecture provides an idea of implementation of Hadoop

2.3 HADOOP DISTRIBUTED FILE SYSTEM

2.3.1 HDFS Storage

Hadoop HDFS features are as follows

i. Create, append, delete, rename and attribute modification functions

the end of the file

iv. Average file size can be more than 500 MB.

Figure 2.3 A Hadoop cluster example,

2.3.1.1 Hadoop Physical organization

• The file section is stored in which part of the cluster

• Last access time for the files

• User permissions like which user has access to the file.

• Single Name Node failure in Hadoop 1 is an operational limitation.

2.3.1.2 HDFC Commands

2.4 MAPREDUCE FRAMEWORK AND PROGRAMMING MODEL

Features of Map Reduce framework are as follows:

• Provides automatic parallelization and distribution of computation based on several

• Processes data stored on distributed clusters of Data Nodes and racks

• Allows processing large amount of data in parallel

• Provides scalability for usages of large number of servers

• Provides Map Reduce batch-oriented programming model in Hadoop version 1

• Provides additional processing modes in Hadoop 2 YARN-based system and enables

2.5 HADOOP YARN

• YARN separates the resources management and processing components.

• It stands for YET ANOTHER RESOURCE NEGOTIATOR , it manages and allocates

Figure 2.5 YARN based Execution Model

List of actions of YARN resource allocation and scheduling functions is as follows:

2.6 HADOOP ECOSYSTEM TOOLS

Hive -A datawarehouse Provisions data aggregation, data-summarization, data warehouse

You might also like