0% found this document useful (0 votes)

93 views67 pages

Big Data

This document provides an overview of big data and Hadoop. It defines big data as large data files, advanced analytics, and data from visualization tools. It describes the structured, semi-structured, and unstructured nature of big data. Key properties of big data systems are discussed like scalability, fault tolerance, and data size supported. Horizontal and vertical scaling approaches for big data are outlined. The document also provides a high-level overview of Hadoop including HDFS, MapReduce, and YARN. Common Hadoop-related projects like Pig, HBase, Zookeeper, and Hive are also briefly introduced.

Uploaded by

tamizhanps

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

93 views67 pages

Big Data

Uploaded by

tamizhanps

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 67

Big Data

Kailash S
C-DAC
Chennai

Which is Big Data

Large data files (65%)
to
Advanced analytics or analysis (60%)
to
Data from visualization tools (50%)

Structured, Semi, Unstructured

Properties

Scalability
Data I/O Performance
Fault tolerance
Real-time processing
Data size supported
Iterative task support

Scalability
Horizontal scaling
Peer to Peer
MPI

Apache Hadoop
HDFS, YARN

Map reduce
Spark

Vertical scaling

High Performance Computing Clusters

Multi Core systems
Graphical processing unit
Field Programmable Gate arrays

Choose platform for big data analytics

Data size
Speed or throughput optimization
Map reduce impact

Types of Analytics
Prescriptive Analytics
Predictive Analytics
Descriptive Analytics.

Data Loading Scenario

Data at rest
Motion
Web server / sensor logs
FLUME

Database
SQOOP SQL + HADOOP

Types of tools used in Big Data

Scenario
Where is the processing hosted?
Distributed server/cloud
Where data is stored?
Distributed Storage (eg: Amazon s3)
Where is the programming model?
Distributed processing (Map Reduce)
How data is stored and indexed?
High performance schema free database
What operations are performed on the data?
Analytic/Semantic Processing (Eg. RDF/OWL)

Terminology
Google calls it:

Hadoop equivalent:

MapReduce

Hadoop

GFS

HDFS

Bigtable

HBase

Chubby

Zookeeper

HADOOP

What is Hadoop ?

Framework for running applications and storing data over

large clusters.
Provides a distributed file system (HDFS) that stores data
on the nodes.
Data replication takes place, hence data is never lost.
Hadoop implements Map/Reduce

Application's process is divided into many small

fragments of work

Each of which may be executed on any node in the

cluster

Actual Parallel processing.

Hadoop Architecture

Data

Hadoop Cluster

Data data data data data

Data data data data data
Data data data data data
Data data data data data
Data data data data data
Data data data data data
Data data data data data
Data data data data data
Data data data data data

DFS Block 1

DFS Block 1
DFS Block 2
DFS Block 2 MAP

Results

MAP

Reduce

DFS Block 2

Data data data data data

Data data data data data
Data data data data data

MAP
DFS Block 3

DFS Block 3
DFS Block 3

Data data data data

Data data data data
Data data data data
Data data data data
Data data data data
Data data data data
Data data data data
Data data data data
Data data data data

Architecture of Hadoop DB

EDBT 2011 Tutorial

What is HDFS ?

HDFS stands for Hadoop Distributed File System

primary storage system used by Hadoop
applications.
HDFS splits the data into several pieces called data
blocks.
HDFS creates multiple replicas of data blocks and
distributes them on compute nodes throughout a
cluster to enable reliable, extremely rapid
computations.

HDFS Architecture

Architecture
Hadoop is based on Master-Slave architecture.
An HDFS cluster consists of a single,

Namenode (master server) that manages the file

system.
Datanodes (Slaves), which manage storage attached to
the nodes.

Internally, a file is split into one or more blocks

and these blocks are stored in a set of Datanodes.

Namenode & Datanode

Namenode executes file system namespace

operations

opening, closing, and renaming files and directories.

Determines the mapping of blocks to Datanodes.
Repository for all HDFS metadata.

Datanodes are responsible for serving read and write

requests from the file systems clients.

Perform block creation, deletion, and replication upon

instruction from the Namenode.

Hadoop Distributed File System

HDFS Server

Master node

HDFS Client
Application

Local file
system
Block size: 2K

Name Nodes
Block size: 128M
Replicated
6/23/2010

Wipro Chennai 2011

HDFS Architecture
Metadata ops

Metadata(Name, replicas..)
(/home/foo/data,6. ..

Namenode

Client
Block ops
Read

Datanodes

Datanodes
replication

B
Blocks

Rack1

Write
Client

Rack2

Nodes, Trackers, Tasks

Master node runs JobTracker instance, which

accepts Job requests from clients
TaskTracker instances run on slave nodes
TaskTracker forks separate Java process for
task instances

MAP REDUCE

Map-Reduce

MapReduce is a framework for processing

huge datasets on certain kinds of
distributable problems using a large number
of nodes.

MapReduce Architecture

Parallel Execution

Map
The master node takes the input, chops it up into
smaller sub-problems, and distributes those to
worker nodes.
A worker node may do this again in turn, leading
to a multi-level tree structure.
The worker node processes that smaller problem,
and passes the answer back to its master node.

Reduce
The master node then takes the answers to
all the sub-problems and combines them
in a way to get the output - the answer to
the problem it was originally trying to
solve.

MapReduce: High Level

Master node
MapReduce job
submitted by
client computer

JobTracker

In our case: circe.rc.usf.edu

Slave node

TaskTracker

Task instance

Large scale data splits

Map <key, 1>

<key, value>pair

Reducers (say, Count)

Parse-hash
Count

P-0000
, count1

Parse-hash
Count

P-0001
, count2

Parse-hash

Count

Parse-hash
6/23/2010

Wipro Chennai 2011

P-0002
,count3
40

Input key*value
pairs

...
map

map

Data store 1

Data store n

(key 1,
values...)

(key 2,
values...)

(key 3,
values...)

(key 2,
values...)

(key 1,
values...)

(key 3,
values...)

== Barrier == : Aggregates intermediate values by output key

key 1,
intermediate
values

key 2,
intermediate
values

key 3,
intermediate
values

reduce

final key 1
values

final key 2
values

final key 3
values

Hadoop Internal process

Job Launch Process
Client, Job Client
JobTracker
Task tracker, Task, Task runner

Creating mapper
Mapper

Getting Data To The Mapper

InputFormat

Input file

InputSplit

RecordReader

Mapper

(intermediates)

Reading data
File input format and friends
Filtering file inputs
Record readers
Input split size
Sending data to reducers
Writable comparator
Sending data to client

shuffling

Partition And Shuffle

Mapper

(intermediates)

Partitioner

(intermediates)

Reducer

Partitioner
Reduction
Output format

OutputFormat

Finally: Writing The Output

Reducer

RecordWriter

output file

Other Hadoop related Projects

Pig

HBase

Distributed, column-oriented database.

HBase uses HDFS for its underlying storage

Supports both batch-style computations using MapReduce and point queries (random
reads).

ZooKeeper

A data flow language and execution environment for exploring very large datasets. Pig
runs on HDFS and MapReduce clusters.

Distributed, highly available coordination service.

Provides primitives such as distributed locks that can be used for building distributed
applications.

Hive

Distributed data warehouse.

Manages data stored in HDFS and provides a query language based on SQL (and which
is translated by the runtime engine to MapReduce jobs) for querying the data.

Hive
A database/data warehouse on top of Hadoop
Rich data types (structs, lists and maps)
Efficient implementations of SQL filters, joins and groupbys on top of map reduce

Allow users to access Hive data without using

Hive

Hive Architecture
Map Reduce

Web UI
Mgmt, etc

Hive CLI
Browsing Queries DDL

Hive QL

MetaStore
Parser

Planner

Execution

SerDe
Thrift Jute JSON

Thrift API

HDFS

BIG DATA is not just HADOOP

Understand and navigate
federated big data sources

Federated Discovery and Navigation

Manage & store huge

volume of any data

Hadoop File System

MapReduce

Structure and control data

Data Warehousing

Manage streaming data

Stream Computing

Analyze unstructured data

Text Analytics Engine

Integrate and govern all

data sources

Integration, Data Quality, Security,

Lifecycle Management, MDM

SPARK

University of California
Reduce Disk I/O limitations
Java, Scala, Python
In-memory computation
100x faster
Berkeley Data Analytics stack

INSTALLATION OF PACKAGES
Linux packages

Binary installation

Source installation

Manual
Manual

Automated

(Pre requisites &

Dependencies)

Main package

Configuration

Installing Hadoop

Hadoop comes as a standalone download from

http://www.motorlogy.com/apachemirror//hadoop/core/hadoop0.20.2/hadoop-0.20.2.tar.gz

Pre-requasites
JDK 1.5 or above
SSH

Requires to create a new user with ownership for the following

folders
Path of Hadoop folder
Path where the Data's are placed.

Configuring Hadoop

There are 5 major files in hadoop/conf folder which are to be configured.

masters
slaves
core-site.xml
hdfs-site.xml
mapred-site.xml
hadoop-env.sh
Configuration may be of either of the two types
Single-node cluster
Multi-node cluster

Steps to configure

Map the IP addresses of all nodes with its host name i.e /etc/hosts file

Add environmental variables HADOOP_HOME

Add the master node name in masters file.

Add the master node and data nodes name in slaves file.

Add JAVA_HOME path in hadoop-env.sh

Login into the hadoop user.

Continued ...

Create a SSH rsa key and copy it to all the data nodes from master node.

ssh-keygen -t rsa -P ""

cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
ssh-copy-id -i $HOME/.ssh/id_rsa.pub hadoop@slave
ssh hadoop@master
ssh hadoop@slave

Configuring core-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://master:8020</value>
</property>
</configuration>

Configuring hdfs-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
</configuration>

Configuring mapred-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>master:8021</value>
</property>
</configuration>

Starting Hadoop

hadoop namenode -format

sh start-all.sh
Starts Namenode
Starts Secondary Namenode
Starts Datanode
Starts Job Tracker
Starts Trace Tracker
To view Job tracker and Trace Tracker Web Interface
http://master:50070/ -- Task tracker
http://master:50030/ -- Job Tracker

Sample Hadoop commands

hadoop dfsadmin -report

Standard report about the nodes and replication.

hadoop fs -ls
List all files on the HDFS.
hadoop fs -mkdir abcd
Creates a directory on HDFS

Continued ...

hadoop fs -put /root/abc.txt abcd/input

To put external data into HDFS

hadoop jar hadoop-example.jar \com.hadoop.WordCount \abcd/input

abcd/output
Executing a map-reduce algorithm.

Internet of Things (IoT)

Extending the current Internet and providing
connection, communication, and internetworking between devices and physical
objects or "Things.
The technologies and solutions that enable
integration of real world data and services into
the current information networking
technologies are described under the
umbrella term of the Internet of Things (IoT).

Thing connected to the internet

Image Courtesy: : CISCO

Future Networks IOT

Esm 101 Module Updated Notes Mmust-2
No ratings yet
Esm 101 Module Updated Notes Mmust-2
80 pages
(UPCRES 2021) A Primer On Communication and Media Research
100% (1)
(UPCRES 2021) A Primer On Communication and Media Research
181 pages
Spark in Production
No ratings yet
Spark in Production
34 pages
Booths Multiplication Examples
No ratings yet
Booths Multiplication Examples
1 page
EBX One Pager 081522
No ratings yet
EBX One Pager 081522
1 page
Amozesh GIS
No ratings yet
Amozesh GIS
84 pages
Biostatistics and Research Unit 1
No ratings yet
Biostatistics and Research Unit 1
28 pages
Owusu Emmanuel Nketia 2
No ratings yet
Owusu Emmanuel Nketia 2
4 pages
Biology Ee Rubric
No ratings yet
Biology Ee Rubric
7 pages
BigQuery For Data Warehouse Practitioners - Solutions - Google Cloud
No ratings yet
BigQuery For Data Warehouse Practitioners - Solutions - Google Cloud
25 pages
Technical Paper (Sanjhen)
No ratings yet
Technical Paper (Sanjhen)
10 pages
Important For Opnqryfile With Join Condition
No ratings yet
Important For Opnqryfile With Join Condition
7 pages
Sample Question Paper For Placement
No ratings yet
Sample Question Paper For Placement
5 pages
Data Analytics With Visualization Brochure
No ratings yet
Data Analytics With Visualization Brochure
13 pages
Information Lifecycle Management: Frequently Asked Questions (FAQ)
No ratings yet
Information Lifecycle Management: Frequently Asked Questions (FAQ)
10 pages
13 1+SAP+HANA+Scale-out
No ratings yet
13 1+SAP+HANA+Scale-out
11 pages
005 Lab-Blind-SQLi-WebAppSecurity
No ratings yet
005 Lab-Blind-SQLi-WebAppSecurity
5 pages
Automatic Storage Management
No ratings yet
Automatic Storage Management
76 pages
Microsoft SQL Server Connection String
No ratings yet
Microsoft SQL Server Connection String
5 pages
Fiche MSC in DMDS Oct2023
No ratings yet
Fiche MSC in DMDS Oct2023
2 pages
Chapter 2 Data Encoding Techniques
No ratings yet
Chapter 2 Data Encoding Techniques
63 pages
Of Programming Languages by Ravi Sethi
No ratings yet
Of Programming Languages by Ravi Sethi
22 pages
Slide 3 Hadoop MapReduce Tutorial
No ratings yet
Slide 3 Hadoop MapReduce Tutorial
119 pages
Price Change June 2021
No ratings yet
Price Change June 2021
20 pages
Big Data Topic3 (Spark) (Thanh Binh Nguyen) .TextMark
No ratings yet
Big Data Topic3 (Spark) (Thanh Binh Nguyen) .TextMark
60 pages
Spark Summit: June 2014
No ratings yet
Spark Summit: June 2014
32 pages
Chapter 1-5
No ratings yet
Chapter 1-5
28 pages
DWM Receiver
No ratings yet
DWM Receiver
12 pages
Hadoop Notes
No ratings yet
Hadoop Notes
8 pages
MySQL Settings
No ratings yet
MySQL Settings
2 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
Utils C Ultra PDF
No ratings yet
Utils C Ultra PDF
693 pages
Facebook Hive POC
No ratings yet
Facebook Hive POC
18 pages
Python Peewee
No ratings yet
Python Peewee
142 pages
Assmt-I AI (1.3.2016)
No ratings yet
Assmt-I AI (1.3.2016)
1 page
Knowledge Inference
No ratings yet
Knowledge Inference
122 pages
Zero To Advance DSA in 30 Days
No ratings yet
Zero To Advance DSA in 30 Days
33 pages
MySQL CONTROL STATEMENTS
No ratings yet
MySQL CONTROL STATEMENTS
4 pages
73857-Big Data Powerpoint Templates-4-3
No ratings yet
73857-Big Data Powerpoint Templates-4-3
30 pages
How To Create Pivot Tables in Excel
No ratings yet
How To Create Pivot Tables in Excel
5 pages
Cloudera Academic Partnership 3 PDF
0% (1)
Cloudera Academic Partnership 3 PDF
103 pages
Tiruchirapalli, India 9 Day Weather Forecast: Weather Forecast Issued: 22 PM Tue 04 Oct Local Time
No ratings yet
Tiruchirapalli, India 9 Day Weather Forecast: Weather Forecast Issued: 22 PM Tue 04 Oct Local Time
3 pages
An Introduction To Hadoop Presentation PDF
100% (1)
An Introduction To Hadoop Presentation PDF
91 pages
Advanced Data Model
No ratings yet
Advanced Data Model
18 pages
1.4 HDFS Lab 1H
No ratings yet
1.4 HDFS Lab 1H
23 pages
Yarn Ha Federation
No ratings yet
Yarn Ha Federation
64 pages
Week 4 - Hadoop Ecosystem
No ratings yet
Week 4 - Hadoop Ecosystem
109 pages
Hadoop Ecosystem PDF
No ratings yet
Hadoop Ecosystem PDF
55 pages
Cheat Sheet Maps, Waffles, WordCloud and Seaborn
No ratings yet
Cheat Sheet Maps, Waffles, WordCloud and Seaborn
2 pages
HADOOP
100% (1)
HADOOP
35 pages
PTC Big Data Analysis With ApacheS 27.11-28.11.2019 Handout
No ratings yet
PTC Big Data Analysis With ApacheS 27.11-28.11.2019 Handout
48 pages
Cloudera Hive
No ratings yet
Cloudera Hive
132 pages
Primer On Big Data Testing
No ratings yet
Primer On Big Data Testing
24 pages
2 Bluetooth
No ratings yet
2 Bluetooth
48 pages
IT2302 ITC Syllabus
No ratings yet
IT2302 ITC Syllabus
1 page
Lab - GAE
No ratings yet
Lab - GAE
133 pages
Examples of Tableau Dashboards and Stories
No ratings yet
Examples of Tableau Dashboards and Stories
1 page
Storage Technology Course From EMC
No ratings yet
Storage Technology Course From EMC
4 pages
Data Modeling Tips, Tricks, and Customizations
No ratings yet
Data Modeling Tips, Tricks, and Customizations
50 pages
Spark Optimizations & Deployment
No ratings yet
Spark Optimizations & Deployment
13 pages
Fundamentals of Database Systems 7th Edition CHAPTER 1 Instructor Solutions Manual
88% (8)
Fundamentals of Database Systems 7th Edition CHAPTER 1 Instructor Solutions Manual
3 pages
Fundamentals of Database Systems 7th Edition CHAPTER 1 Instructor Solutions Manual
88% (8)
Fundamentals of Database Systems 7th Edition CHAPTER 1 Instructor Solutions Manual
3 pages
Cloudera Overview PDF
No ratings yet
Cloudera Overview PDF
20 pages
50 Questions To Get Through Google Interview
No ratings yet
50 Questions To Get Through Google Interview
53 pages
Unit I Overview & Instructions: Cs6303-Computer Architecture
100% (1)
Unit I Overview & Instructions: Cs6303-Computer Architecture
16 pages
Impala and BigQuery
No ratings yet
Impala and BigQuery
47 pages
Fingerprint SDK Interface Documentation: 1.1 Initialize
No ratings yet
Fingerprint SDK Interface Documentation: 1.1 Initialize
10 pages
Bubble Sort
No ratings yet
Bubble Sort
3 pages
Bubble Sort
No ratings yet
Bubble Sort
3 pages
Views, Stored Procedures, Functions, and Triggers: Download Mysql From This Link: Install
No ratings yet
Views, Stored Procedures, Functions, and Triggers: Download Mysql From This Link: Install
11 pages
Dev Lab Manual
No ratings yet
Dev Lab Manual
35 pages
MongoDB Performance Best Practices
No ratings yet
MongoDB Performance Best Practices
15 pages
Hadoop Framework
No ratings yet
Hadoop Framework
22 pages
Hadoop and Big Data
No ratings yet
Hadoop and Big Data
41 pages
Introduction To Data Warehouse: Unit I: Data Warehousing
No ratings yet
Introduction To Data Warehouse: Unit I: Data Warehousing
110 pages
07 - Ingesting New Datasets Into Google BigQuery
No ratings yet
07 - Ingesting New Datasets Into Google BigQuery
8 pages
Power BI Developer: Professional Summary
No ratings yet
Power BI Developer: Professional Summary
6 pages
Mapreduce and Hadoop Distributed File System
No ratings yet
Mapreduce and Hadoop Distributed File System
36 pages
Hadoop Administration Course Content PDF
No ratings yet
Hadoop Administration Course Content PDF
4 pages
Technologies For Handling Big Data: Prepared By: Saidatul Rahah Hamidi
No ratings yet
Technologies For Handling Big Data: Prepared By: Saidatul Rahah Hamidi
49 pages
Data Analysis With Pandas
No ratings yet
Data Analysis With Pandas
7 pages
Midhun BIGDATA Curicullum
No ratings yet
Midhun BIGDATA Curicullum
17 pages
Install Sqoop
No ratings yet
Install Sqoop
7 pages
Cloudera Hadoop Introduction PDF
100% (1)
Cloudera Hadoop Introduction PDF
50 pages
Web Servers (IIS, PWS and Apache) : WW W .D e
No ratings yet
Web Servers (IIS, PWS and Apache) : WW W .D e
6 pages
Apache Pig
100% (2)
Apache Pig
80 pages
Design Document Database
No ratings yet
Design Document Database
62 pages
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
No ratings yet
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
40 pages
Higher Nationals: Internal Verification of Assessment Decisions - BTEC (RQF)
No ratings yet
Higher Nationals: Internal Verification of Assessment Decisions - BTEC (RQF)
78 pages
Build Solutions On GCP
No ratings yet
Build Solutions On GCP
3 pages
Query 1:: Unique Liquor Stores in Iowa
No ratings yet
Query 1:: Unique Liquor Stores in Iowa
3 pages
DW Life Cycle
No ratings yet
DW Life Cycle
114 pages
Python Oop
No ratings yet
Python Oop
10 pages
Datawarehouse Tools
No ratings yet
Datawarehouse Tools
8 pages
MongoDB Pagination
No ratings yet
MongoDB Pagination
6 pages
Apache Sqoop
No ratings yet
Apache Sqoop
21 pages
IBM Big Data Presentation
No ratings yet
IBM Big Data Presentation
32 pages
Learning Apache Spark With Python
No ratings yet
Learning Apache Spark With Python
10 pages
My CDS View Self Study Tutorial - Part 9 Cube View and Query View
No ratings yet
My CDS View Self Study Tutorial - Part 9 Cube View and Query View
9 pages
Data Warehouse Components
No ratings yet
Data Warehouse Components
18 pages
Oltp Olap Rtap
No ratings yet
Oltp Olap Rtap
53 pages
Data Mining N Business Intelligence
No ratings yet
Data Mining N Business Intelligence
63 pages
Super Study Guide: Data Science Tools: Afshine Amidi and Shervine Amidi August 21, 2020
No ratings yet
Super Study Guide: Data Science Tools: Afshine Amidi and Shervine Amidi August 21, 2020
23 pages
Hive Cheat Sheet - Quick Reference
No ratings yet
Hive Cheat Sheet - Quick Reference
19 pages
Mining Your Data Lake For Analytics Insights v3 101420
No ratings yet
Mining Your Data Lake For Analytics Insights v3 101420
16 pages
Computer Networks Question Papers
100% (1)
Computer Networks Question Papers
7 pages
OLTP
No ratings yet
OLTP
12 pages
Unit V Finite Word Length Effects in Digital Filters
75% (4)
Unit V Finite Word Length Effects in Digital Filters
3 pages

Big Data

Uploaded by

Big Data

Uploaded by

Big Data

Which is Big Data

Structured, Semi, Unstructured

High Performance Computing Clusters

Choose platform for big data analytics

Data Loading Scenario

Types of tools used in Big Data

Framework for running applications and storing data over

Application's process is divided into many small

Each of which may be executed on any node in the

Actual Parallel processing.

Data data data data data

Data data data data data

Data data data data

EDBT 2011 Tutorial

HDFS stands for Hadoop Distributed File System

Namenode (master server) that manages the file

Internally, a file is split into one or more blocks

Namenode & Datanode

Namenode executes file system namespace

opening, closing, and renaming files and directories.

Datanodes are responsible for serving read and write

Perform block creation, deletion, and replication upon

Hadoop Distributed File System

Wipro Chennai 2011

Nodes, Trackers, Tasks

Master node runs JobTracker instance, which

MapReduce is a framework for processing

MapReduce: High Level

In our case: circe.rc.usf.edu

Large scale data splits

Map <key, 1>

Reducers (say, Count)

Wipro Chennai 2011

== Barrier == : Aggregates intermediate values by output key

Hadoop Internal process

Getting Data To The Mapper

Partition And Shuffle

Finally: Writing The Output

Other Hadoop related Projects

Distributed, column-oriented database.

HBase uses HDFS for its underlying storage

Distributed, highly available coordination service.

Distributed data warehouse.

Allow users to access Hive data without using

BIG DATA is not just HADOOP

Federated Discovery and Navigation

Manage & store huge

Hadoop File System

Structure and control data

Manage streaming data

Analyze unstructured data

Text Analytics Engine

Integrate and govern all

Integration, Data Quality, Security,

(Pre requisites &

Hadoop comes as a standalone download from

Requires to create a new user with ownership for the following

There are 5 major files in hadoop/conf folder which are to be configured.

Add environmental variables HADOOP_HOME

Add the master node name in masters file.

Add JAVA_HOME path in hadoop-env.sh

Login into the hadoop user.

ssh-keygen -t rsa -P ""

hadoop namenode -format

Sample Hadoop commands

hadoop dfsadmin -report

hadoop fs -put /root/abc.txt abcd/input

hadoop jar hadoop-example.jar \com.hadoop.WordCount \abcd/input

Internet of Things (IoT)

Thing connected to the internet

Image Courtesy: : CISCO

Future Networks IOT

You might also like