0% found this document useful (0 votes)

60 views30 pages

Sample

This document discusses data storage and processing using Hadoop Distributed File System (HDFS). It defines HDFS components like NameNode, DataNode, JobTracker and TaskTracker. It explains different data types, big data, and limitations of RDBMS. HDFS uses blocks and replication for reliability. Configuration files define parameters like block size. HDFS operations include read, write, delete. HDFS is optimized for streaming access patterns of large data sets.

Uploaded by

Soya Bean

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

60 views30 pages

Sample

Uploaded by

Soya Bean

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 30

Bismilla 786

Data is the new oil

90 % of the data today we have generated last two years.
2.5 billion of gb data generated every day.
Reasons
1.Increase Technology
2.Increase Internet usage(Facebook,twitter,gmail etc)
3.Decrease in storage cost
4.Increase online shopping
5.Increase online banking
Ex:
Data Can be categorized into 3 types
1.Structured Data: This data comes from Rdbms systems
2.Unstructured Data: This data comes from
audio,video,pictures etc
3.Semi structured Data: This data comes from json data,xml
data,txt data
What is Big Data: Huge amount of data (structured or un-
structured and semi structured data) that cannot be resolved
using Rdbms is called Big Data.
Diff Rdbms tools used for data storage
1.Oracle
2.My sql
3.DB2 etc
Drawbacks of Rdbms:
1.Storage is one the main drawback.
2.Only structured data can be processed.
3.For storing and processing TB's of data speed is low
4. IF UPDATES ARE MADE IN RDBMS LAST UPDATED VALUE WILL BE
STORED. AND PREVIOUS VALUES ARE NOT STORED.
Data Storage: To store the data in hdfs we need name node and
data node.
Data Process: To process the data we need job tracker and task
tracker.
Meta data: Data about data.
File system: to store files is called file system.
Deoman: back ground services which not have physical
apperance.
Namenode and Job tracker: single point of fail over.
streaming access pattern:
Block: HDFS store User data in to files. Internally, a file is split
into one or more blocks and these blocks are stored in a set of
DataNodes
Name Node: It is one of service running in Hadoop Cluster
It holds metadata details of files storing in Hadoop
The metadata details of namenode is known as fsimage

Secondary name node

It is one of service running in Hadoop Cluster

It will hold backup of namenode data
the metadata details in secondary name node is known as Editlog

Job Tracker

It is one of service running in Hadoop

It receives tasks it has to be performed from NameNode
It schedules Job
It will assign task to task tracker

TaskTracker

It is one of Service running in Hadoop

It receives task which has to be performed by it from job tracker
each datanode has one task tracker

DataNode

It is one of service running in Hadoop

It holds data inform of blocks
by default block size is maintained as 64 mb

Note: Write once and read any no of times but not try to change
the contents of file in hdfs.
Note: Namenode,secondary node,Jobtracker are called master
services or master nodes or deoman
Note: Datanode,task tracker are called slave nodes or deomans

Confifuration files in Hadoop

hdfs-site.xml
Settings:
block size : by default block size in data node is maintained as 64
mb
replica 3
Trash
HeartBeat Interval

core-site.xml: It holds the name node ipaddress

Masters: It holds secondary name node ip address

Slaves: It holds ip address of all data nodes

Replica: Back of file.There are three types of Replica

By DEfault REplica is 3

Over Replica: if no of replica is greater than 3

UNder Replica: if no of replica is less than 3
Missing Replica:

Scaling : It is mainly used to improve performance of system

Scaling is classified into two types

Horizontal scaling : Adding new data node to cluster is known as

Horizontal Scaling

Vertical Scaling: Increasing ram and hardisk o existing data node we say
it as vertical scaling

Different operations we have in Data Storage

HDFS WRITE : writing data into Hadoop Cluster

HDFS READ: Reading data from Hadoop Cluster

HDFS DELETE: deleting data from Hadoop cluster.

Heart Beat Signal : DataNodes send a heartbeat to the NameNode

every three seconds by default.

If the NameNode doesn’t receive any heartbeats for a specified time,

which is ten minutes by default, it assumes the DataNode is lost

Block Level Report: NameNode creates an internal structure of Block to

file level mapping, block to datanode mapping and datanode to RAC
mapping.
As part of the block report, DN's send what blocks are there, how many
are corrupted, how many are over/under replicated.
Based on this block report information, NameNode will build the above
mapping in fsimage. When clients request for a specificl file,
NN will look into the fsimage (mapping) and give the path/DN to the
client

Safemode: is a maintenance state of NameNode and its in read only

mode.

Hadoop dfsadmin –safemode leave

Hadoop dfsadmin –safemode enter
Hadoop dfsadmin –safemode get

ifconfig

Linux Commands:

Local file system commnads(linux commands)

~ : home directory

1. Present working directory: pwd

2. To know list of files in the current directory: ls

3. Long lasting files and directories: ls -l

if its start with - its a file
if its start with d its a directory
if its start with . its a hidden file

4.To display hidden files and direcotiries ls -a

5.To create a directory : mkdir directoryname

6.To change the directory: cd shaik

7. to create more than one directory : mkdir -p durga/krish/natrag

8.To change directory : cd durga/krish/natrag

9.To go to home directory : cd ~

10.To go to root directory : cd /

11. To go to previous directory : cd..

Vi editor:
:wq >> to save and exit
:q! >>exit with out save

Jar file is pending check once

Switch User:
Absoulte path and Relative path:
Absulute path: Giving Entire path(from root onwards)
Relative path: Adjust the location where you are.
Permission:

chmode check onece

HDFS COMMANDS : (Hadoop Distributed File system Commands)

[training@localhost ~]$ start-all.sh
Warning: $HADOOP_HOME is deprecated.

[training@localhost ~]$ stop-all.sh

Warning: $HADOOP_HOME is deprecated.

Start with : hadoop fs

Note: To move one location to another: cp,mv(hdfs to hdfs)
copyFromlocal,moveFromLocal,put from one local file systems to hdfs.

Disck Usage:

3. copyToLocal(Copy a file or directory from HDFS to Local)

hadoop fs –copyToLocal /user/training/hdfs/file1 /home/training/Local
4. moveToLocal(Not yet implemented)

5. touchz ( can create n no: of empty files in HDFS)

hadoop fs –touchz /home/training/hdfs/file1

6. rm (Remove a file)
hadoop fs –rm /home/training/hdfs/file1

7. rmr (Can be used for removing a file or Directory recursively)

hadoop fs –rmr /home/training/hdfs/file

Hive Architecture:

Apache Derby: Store meta details about tables.

Probs: one user tables not visible to other user.
Default db: MySQL
Create a db: Create database <db-name>
To know dbs: show databases;
Delete db: DROP DATABASE database_name CASCADE;
Note: Whenever we create db an entry will be made to
user/warehouse

Hive: For processing/Analysis structured data.

Hive uses HQL which is subset of sql
Hive can store only structured data
Hive can process huge amount of data at high speed
Hive uses udf concept to create user defined functions

DisAdvantages:

hive will occupy space for null values

Updations are not posible(Delete or update rows)

Hive data types:

and string

Map: Key and value pairs

Array: Collection of Same type elements
Struct: Collection of Diff types of elements.

Tables in Hive are classified into two types

Managed Tables: In Managed Tables if we drop table table will

be deleted from Hive as well as from user warehouse
(/user/hive/warehouse)

Load data: Loading data can be from local file system or HDFS
hive> create database sample
>;

Creating table:

hive> create table employee(eno int,ename string,job string,salary dou

ble,comm float,deptno int)
> row format delimited
> fields terminated by ',';

Load data:

hive> load data local inpath '/home/training/Desktop/emp' into table e

mployee;

hdfs mode:
[training@localhost Desktop]$ hadoop fs -ls
/user/hive/warehouse/sample.db
Warning: $HADOOP_HOME is deprecated.

Found 1 items
drwxr-xr-x - training supergroup 0 2018-01-22 17:47 /user/hive/w
arehouse/sample.db/employee

[training@localhost Desktop]$ hadoop fs -cat

/user/hive/warehouse/sample.db /employee/emp

External Tables: In External tables if we drop table table will be

deleted from hive prompt only and it will be still available in
/user/hive/warehouse

External Tables are classified into two types

With Location

Without Location :
Here we create a directroy and create a table and point to that
directory
i.e Data loading and table creation is done in single step

hive > create external table ext_emp(eno int,ename string,job

string,salary double,comm float,deptno int)
> row format delimited
> fields terminated by ','
> location '/user/training/inpt_hive';

Partitions: Used for data segregation and to prevent the entire

table scan
Steps:

Create a temporary table

Load data into temporary table
create partition table
set hive.exec.dynamic.mode=nonstrict
insert data into partition table by select data from temporary table

View:

Creating view: create view view-name as select stmt

sub view: u can create sub view from an existing view.

create view view_name as select column list from view_name

Bucketing: Optimization Technique

Used for join optimization and table sampling.
Here we used clustered by
Steps:

1.Create a temporary table

2.load data into temp table
3.Create buckting table.
4.set hive.enforce.buckting=true
5.insert data into buckting table from temp table.

hive> create table bemp(eno int,ename string,job string,salary

double,comm float,deptno int)
> clustered by (deptno)
> into 3 buckets
> row format delimited
> fields terminated by ',';

select * from bemp tablesample(70 percent);

select * from bemp tablesample (bucket 1 out of 3);

Index: Like a pointer to a particular column( store column address)

Why used: Peformance will be improved.

Index 2 types
1.Bit map
2.compact

create index emp_index on table t1(eno) as 'COMPACT' with deferred

REBUILD;

Array ex:
Ramesh,1,maths$physics$chemistry,a
Suresh,2,bilogy$maths$physices,b
create table array_table(sname string,sid int,sub array<string>,grade
string)
> row format delimited
> fields terminated by ','
> collection items terminated by '$';

select sname,sid,sub[0],sub[1],sub[2],grade from array_table;

Map: [training@localhost Desktop]$ cat map.csv

ravi,123,salary#50000$comm#3000,10
suresh,456,salary#4000$comm#2000,20

hive> create table map_table(ename string,eid int,paydetails

map<string,double>,deptno int)
> row format delimited
> fields terminated by ','
> collection items terminated by '$'
> map keys terminated by '#';

Struct: [training@localhost Desktop]$ cat >struct.csv

ravi,123,456$juhu$mumbai
suresh,345,789$electroniccity$banglore

hive> create table struct_table(ename string,eid int,address

struct<hno:int,street:string,city:string>,deptno int)
> row format delimited
> fields terminated by ','
> collection items terminated by '$';

PIG: It is second generation data analysis/Processing tool

It can process structured,unstructured,semistructured data
Pig is session oriented(Pig doesn't have warehouse or meta
store) like Gajini

Built on top of Map Reduce.

PigEngine it converts pig scripts to mapreduce jobs

Pig can work in two modes

install linux then install pig
Pig -x Mapreduce

install linux.hadoop,pig

local modes: Here we have input file and output file in linux file
system here we can install pig without hadoop mapreduce

input and output files are in HDFS filesystem we need hadoop

and then only pig

Pig Data types:

int,float,long,double
bytearray,chararray,bag,tuple and atom

Note: Default datatype bytearray.

OR define with out headers(Default data type will be taken)
Note: To see the out put or select stmt is DUMP
Filtering the bag(Row):

Note: For every command you have to give bag name.

Yahoo Hadoop Tutorial
No ratings yet
Yahoo Hadoop Tutorial
28 pages
APM Distributed Tracing Fundamentals Practice Exam
0% (1)
APM Distributed Tracing Fundamentals Practice Exam
3 pages
3 Hadoop
No ratings yet
3 Hadoop
40 pages
BDA UNIT - 3 Updated
No ratings yet
BDA UNIT - 3 Updated
25 pages
Big Data Ia Answers
No ratings yet
Big Data Ia Answers
14 pages
Unit-1 Introduction To Big Data
No ratings yet
Unit-1 Introduction To Big Data
38 pages
Hadoop Echosystem and Ibm Big Insights: Rafie Tarabay Eng - Rafie@Mans - Edu.Eg
No ratings yet
Hadoop Echosystem and Ibm Big Insights: Rafie Tarabay Eng - Rafie@Mans - Edu.Eg
112 pages
Unit II Hadoop and Map Reduce Overview
No ratings yet
Unit II Hadoop and Map Reduce Overview
136 pages
Lab2 BD
No ratings yet
Lab2 BD
20 pages
Printing Big Data Hadoop
No ratings yet
Printing Big Data Hadoop
24 pages
Hadoop
No ratings yet
Hadoop
71 pages
Unit 3 Full
No ratings yet
Unit 3 Full
89 pages
Hadoopintro
No ratings yet
Hadoopintro
31 pages
Hadoop BigData Testing Overview
No ratings yet
Hadoop BigData Testing Overview
37 pages
Business Intelligence & Big Data Analytics-CSE3124Y
No ratings yet
Business Intelligence & Big Data Analytics-CSE3124Y
26 pages
HDFS Commands Updated
No ratings yet
HDFS Commands Updated
87 pages
Hadoop - The Final Product
100% (2)
Hadoop - The Final Product
42 pages
Hadoop Class 1 PDF
No ratings yet
Hadoop Class 1 PDF
27 pages
Chap4 BigDataStorageAndManagement
No ratings yet
Chap4 BigDataStorageAndManagement
46 pages
5.apache Hadoop Updated
No ratings yet
5.apache Hadoop Updated
57 pages
Bda 3
No ratings yet
Bda 3
70 pages
Big Data
No ratings yet
Big Data
67 pages
4
No ratings yet
4
53 pages
Bda Mod 1
No ratings yet
Bda Mod 1
32 pages
HDFS
No ratings yet
HDFS
11 pages
Bda - Unit 2
No ratings yet
Bda - Unit 2
56 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
5 pages
Unit I
No ratings yet
Unit I
38 pages
HCIA Big Data
No ratings yet
HCIA Big Data
20 pages
Big Data Lecture # 05
No ratings yet
Big Data Lecture # 05
22 pages
Big Data-UNIT-2
No ratings yet
Big Data-UNIT-2
46 pages
kh5 (Bda) Merged
No ratings yet
kh5 (Bda) Merged
21 pages
Big Data Unit 3 by Multi Atoms
No ratings yet
Big Data Unit 3 by Multi Atoms
6 pages
Hadoop: A Software Framework For Data Intensive Computing Applications
No ratings yet
Hadoop: A Software Framework For Data Intensive Computing Applications
47 pages
Hadoop ISE 2
No ratings yet
Hadoop ISE 2
25 pages
Unit 3 (Big Data Analytics)
No ratings yet
Unit 3 (Big Data Analytics)
18 pages
Introduction To Hadoop - Chapter-2
No ratings yet
Introduction To Hadoop - Chapter-2
59 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
84 pages
1 Hdfs Notes
No ratings yet
1 Hdfs Notes
38 pages
HADOOP and PYTHON For BEGINNERS - 2 BOOKS in 1 - Learn Coding Fast! HADOOP and PYTHON Crash Course, A QuickStart Guide, Tutorial Book by Program Examples, in Easy Steps!
100% (1)
HADOOP and PYTHON For BEGINNERS - 2 BOOKS in 1 - Learn Coding Fast! HADOOP and PYTHON Crash Course, A QuickStart Guide, Tutorial Book by Program Examples, in Easy Steps!
89 pages
Big Data Unit - 2
No ratings yet
Big Data Unit - 2
18 pages
Big Data Aktu Unit 3
No ratings yet
Big Data Aktu Unit 3
90 pages
Final Bda 1-8 Lab Aayush
No ratings yet
Final Bda 1-8 Lab Aayush
17 pages
Module 02 - Learners Guide
No ratings yet
Module 02 - Learners Guide
82 pages
Unit - II
No ratings yet
Unit - II
64 pages
Apache Hadoop Filesystem and Its Usage in Facebook
No ratings yet
Apache Hadoop Filesystem and Its Usage in Facebook
33 pages
HDFS Internals
No ratings yet
HDFS Internals
30 pages
Unit 1,2,3,4
No ratings yet
Unit 1,2,3,4
116 pages
Hadoop Notes
No ratings yet
Hadoop Notes
8 pages
Bigdta Unit 3
No ratings yet
Bigdta Unit 3
65 pages
Unit-2 Introduction To Hadoop
No ratings yet
Unit-2 Introduction To Hadoop
19 pages
Lecture 4 Introduction To Hadoop
No ratings yet
Lecture 4 Introduction To Hadoop
25 pages
Hadoop Presentation
No ratings yet
Hadoop Presentation
19 pages
2-Hadoop History Terminologies DFS-03-01-2025
No ratings yet
2-Hadoop History Terminologies DFS-03-01-2025
52 pages
Unit - 2
No ratings yet
Unit - 2
42 pages
BDH Unit 3
No ratings yet
BDH Unit 3
25 pages
Unit 3
No ratings yet
Unit 3
5 pages
Hadoop Intro and Hdfs
No ratings yet
Hadoop Intro and Hdfs
37 pages
Big Data Notes
No ratings yet
Big Data Notes
191 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Quick Configuration of Openldap and Kerberos in Linux and Authenicating Linux to Active Directory
From Everand
Quick Configuration of Openldap and Kerberos in Linux and Authenicating Linux to Active Directory
Dr. Hidaia Mahmood Alassouli
No ratings yet
Sample Ebook Interview Questions
No ratings yet
Sample Ebook Interview Questions
25 pages
Ebook Interview Questions
No ratings yet
Ebook Interview Questions
200 pages
Oops 1.what Is Object Oriented Lang Ans: The Lang Which Support 4
No ratings yet
Oops 1.what Is Object Oriented Lang Ans: The Lang Which Support 4
5 pages
Bismil La Javascript
No ratings yet
Bismil La Javascript
13 pages
C1 C11 1234 C2 C22 5678: Countyname Countyofficialsname Orderitemtypeid
No ratings yet
C1 C11 1234 C2 C22 5678: Countyname Countyofficialsname Orderitemtypeid
3 pages
C1 C11 1234 C2 C22 5678: Countyname Countyofficialsname Orderitemtypeid
No ratings yet
C1 C11 1234 C2 C22 5678: Countyname Countyofficialsname Orderitemtypeid
3 pages
One Thing in Mircor Summary
No ratings yet
One Thing in Mircor Summary
2 pages
New
No ratings yet
New
1 page
HPE Remote Copy Active Sync-A00135531enw
No ratings yet
HPE Remote Copy Active Sync-A00135531enw
16 pages
Devnet 1814
No ratings yet
Devnet 1814
31 pages
CS411 Mid Term MCQs With Reference Solved
No ratings yet
CS411 Mid Term MCQs With Reference Solved
17 pages
Tutorial 3
No ratings yet
Tutorial 3
7 pages
Power BI Notes by Nagaraju
No ratings yet
Power BI Notes by Nagaraju
151 pages
ERP Fundamentals
No ratings yet
ERP Fundamentals
23 pages
RTOS-Components: Network Stack
No ratings yet
RTOS-Components: Network Stack
35 pages
BizDevOps - Group 5
No ratings yet
BizDevOps - Group 5
11 pages
Job Information: Engineer Checked Approved Name: Date: Project ID
No ratings yet
Job Information: Engineer Checked Approved Name: Date: Project ID
45 pages
Database Concepts
No ratings yet
Database Concepts
11 pages
Docx
No ratings yet
Docx
9 pages
KENDRIYA VIDYALAYA, KAVARATTI 682555 (U.T. of Lakshadweep) : Mcqs Class 12
No ratings yet
KENDRIYA VIDYALAYA, KAVARATTI 682555 (U.T. of Lakshadweep) : Mcqs Class 12
28 pages
Cysa + 2
No ratings yet
Cysa + 2
179 pages
Android: Application Framework Dalvik Virtual Machine Integrated Browser
No ratings yet
Android: Application Framework Dalvik Virtual Machine Integrated Browser
41 pages
TSplus Customer Portal User Guide
No ratings yet
TSplus Customer Portal User Guide
7 pages
IP Investigatory Project
No ratings yet
IP Investigatory Project
30 pages
Cito Massimiliano - CV - Network Administrator
No ratings yet
Cito Massimiliano - CV - Network Administrator
2 pages
Ashish Kumar Jha Pro 5470633589724
No ratings yet
Ashish Kumar Jha Pro 5470633589724
2 pages
Lab Manual 01 CSE 324 Project Proposal
No ratings yet
Lab Manual 01 CSE 324 Project Proposal
6 pages
Oracle Diagnostics
No ratings yet
Oracle Diagnostics
90 pages
Simulado 5
No ratings yet
Simulado 5
10 pages
Lightning Aura
No ratings yet
Lightning Aura
613 pages
SAP Implementation
100% (1)
SAP Implementation
2 pages
Professional Cloud Devops Engineer - 6
No ratings yet
Professional Cloud Devops Engineer - 6
10 pages
SimpliLearn-Class Recording Link
No ratings yet
SimpliLearn-Class Recording Link
3 pages
AnmolRajSoni Offcampus
No ratings yet
AnmolRajSoni Offcampus
2 pages
Search Operation in An Unsorted Array, The Search Operation Can Be Performed by Linear Traversal From The First Element To The Last Element
No ratings yet
Search Operation in An Unsorted Array, The Search Operation Can Be Performed by Linear Traversal From The First Element To The Last Element
5 pages
174819-Market Basket Analysis
No ratings yet
174819-Market Basket Analysis
54 pages
Image Processing Based Student Details Verification Using Raspberry Pi
No ratings yet
Image Processing Based Student Details Verification Using Raspberry Pi
13 pages