0% found this document useful (0 votes)
60 views30 pages

Sample

This document discusses data storage and processing using Hadoop Distributed File System (HDFS). It defines HDFS components like NameNode, DataNode, JobTracker and TaskTracker. It explains different data types, big data, and limitations of RDBMS. HDFS uses blocks and replication for reliability. Configuration files define parameters like block size. HDFS operations include read, write, delete. HDFS is optimized for streaming access patterns of large data sets.

Uploaded by

Soya Bean
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views30 pages

Sample

This document discusses data storage and processing using Hadoop Distributed File System (HDFS). It defines HDFS components like NameNode, DataNode, JobTracker and TaskTracker. It explains different data types, big data, and limitations of RDBMS. HDFS uses blocks and replication for reliability. Configuration files define parameters like block size. HDFS operations include read, write, delete. HDFS is optimized for streaming access patterns of large data sets.

Uploaded by

Soya Bean
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 30

Bismilla 786

Data is the new oil


90 % of the data today we have generated last two years.
2.5 billion of gb data generated every day.
Reasons
1.Increase Technology
2.Increase Internet usage(Facebook,twitter,gmail etc)
3.Decrease in storage cost
4.Increase online shopping
5.Increase online banking
Ex:
Data Can be categorized into 3 types
1.Structured Data: This data comes from Rdbms systems
2.Unstructured Data: This data comes from
audio,video,pictures etc
3.Semi structured Data: This data comes from json data,xml
data,txt data
What is Big Data: Huge amount of data (structured or un-
structured and semi structured data) that cannot be resolved
using Rdbms is called Big Data.
Diff Rdbms tools used for data storage
1.Oracle
2.My sql
3.DB2 etc
Drawbacks of Rdbms:
1.Storage is one the main drawback.
2.Only structured data can be processed.
3.For storing and processing TB's of data speed is low
4. IF UPDATES ARE MADE IN RDBMS LAST UPDATED VALUE WILL BE
STORED. AND PREVIOUS VALUES ARE NOT STORED.
Data Storage: To store the data in hdfs we need name node and
data node.
Data Process: To process the data we need job tracker and task
tracker.
Meta data: Data about data.
File system: to store files is called file system.
Deoman: back ground services which not have physical
apperance.
Namenode and Job tracker: single point of fail over.
streaming access pattern:
Block: HDFS store User data in to files. Internally, a file is split
into one or more blocks and these blocks are stored in a set of
DataNodes
Name Node: It is one of service running in Hadoop Cluster
It holds metadata details of files storing in Hadoop
The metadata details of namenode is known as fsimage

Secondary name node

It is one of service running in Hadoop Cluster


It will hold backup of namenode data
the metadata details in secondary name node is known as Editlog

Job Tracker

It is one of service running in Hadoop


It receives tasks it has to be performed from NameNode
It schedules Job
It will assign task to task tracker

TaskTracker

It is one of Service running in Hadoop


It receives task which has to be performed by it from job tracker
each datanode has one task tracker

DataNode

It is one of service running in Hadoop


It holds data inform of blocks
by default block size is maintained as 64 mb

Note: Write once and read any no of times but not try to change
the contents of file in hdfs.
Note: Namenode,secondary node,Jobtracker are called master
services or master nodes or deoman
Note: Datanode,task tracker are called slave nodes or deomans

Confifuration files in Hadoop

hdfs-site.xml
Settings:
block size : by default block size in data node is maintained as 64
mb
replica 3
Trash
HeartBeat Interval

core-site.xml: It holds the name node ipaddress

Masters: It holds secondary name node ip address

Slaves: It holds ip address of all data nodes

Replica: Back of file.There are three types of Replica


By DEfault REplica is 3

Over Replica: if no of replica is greater than 3


UNder Replica: if no of replica is less than 3
Missing Replica:

Scaling : It is mainly used to improve performance of system

Scaling is classified into two types

Horizontal scaling : Adding new data node to cluster is known as


Horizontal Scaling

Vertical Scaling: Increasing ram and hardisk o existing data node we say
it as vertical scaling

Different operations we have in Data Storage

HDFS WRITE : writing data into Hadoop Cluster


HDFS READ: Reading data from Hadoop Cluster

HDFS DELETE: deleting data from Hadoop cluster.

Heart Beat Signal : DataNodes send a heartbeat to the NameNode


every three seconds by default.

If the NameNode doesn’t receive any heartbeats for a specified time,


which is ten minutes by default, it assumes the DataNode is lost

Block Level Report: NameNode creates an internal structure of Block to


file level mapping, block to datanode mapping and datanode to RAC
mapping.
As part of the block report, DN's send what blocks are there, how many
are corrupted, how many are over/under replicated.
Based on this block report information, NameNode will build the above
mapping in fsimage. When clients request for a specificl file,
NN will look into the fsimage (mapping) and give the path/DN to the
client

Safemode: is a maintenance state of NameNode and its in read only


mode.

Hadoop dfsadmin –safemode leave


Hadoop dfsadmin –safemode enter
Hadoop dfsadmin –safemode get

ifconfig

Linux Commands:

Local file system commnads(linux commands)


~ : home directory

1. Present working directory: pwd

2. To know list of files in the current directory: ls

3. Long lasting files and directories: ls -l


if its start with - its a file
if its start with d its a directory
if its start with . its a hidden file

4.To display hidden files and direcotiries ls -a

5.To create a directory : mkdir directoryname


6.To change the directory: cd shaik

7. to create more than one directory : mkdir -p durga/krish/natrag

8.To change directory : cd durga/krish/natrag

9.To go to home directory : cd ~

10.To go to root directory : cd /

11. To go to previous directory : cd..


Vi editor:
:wq >> to save and exit
:q! >>exit with out save

Jar file is pending check once


Switch User:
Absoulte path and Relative path:
Absulute path: Giving Entire path(from root onwards)
Relative path: Adjust the location where you are.
Permission:

chmode check onece

HDFS COMMANDS : (Hadoop Distributed File system Commands)


[training@localhost ~]$ start-all.sh
Warning: $HADOOP_HOME is deprecated.

[training@localhost ~]$ stop-all.sh


Warning: $HADOOP_HOME is deprecated.

Start with : hadoop fs


Note: To move one location to another: cp,mv(hdfs to hdfs)
copyFromlocal,moveFromLocal,put from one local file systems to hdfs.

Disck Usage:

3. copyToLocal(Copy a file or directory from HDFS to Local)


hadoop fs –copyToLocal /user/training/hdfs/file1 /home/training/Local
4. moveToLocal(Not yet implemented)

5. touchz ( can create n no: of empty files in HDFS)


hadoop fs –touchz /home/training/hdfs/file1

6. rm (Remove a file)
hadoop fs –rm /home/training/hdfs/file1

7. rmr (Can be used for removing a file or Directory recursively)


hadoop fs –rmr /home/training/hdfs/file

Hive Architecture:

Apache Derby: Store meta details about tables.


Probs: one user tables not visible to other user.
Default db: MySQL
Create a db: Create database <db-name>
To know dbs: show databases;
Delete db: DROP DATABASE database_name CASCADE;
Note: Whenever we create db an entry will be made to
user/warehouse

Hive: For processing/Analysis structured data.


Hive uses HQL which is subset of sql
Hive can store only structured data
Hive can process huge amount of data at high speed
Hive uses udf concept to create user defined functions

DisAdvantages:

hive will occupy space for null values


Updations are not posible(Delete or update rows)

Hive data types:

and string

Map: Key and value pairs


Array: Collection of Same type elements
Struct: Collection of Diff types of elements.

Tables in Hive are classified into two types

Managed Tables: In Managed Tables if we drop table table will


be deleted from Hive as well as from user warehouse
(/user/hive/warehouse)

Load data: Loading data can be from local file system or HDFS
hive> create database sample
>;

Creating table:

hive> create table employee(eno int,ename string,job string,salary dou


ble,comm float,deptno int)
> row format delimited
> fields terminated by ',';

Load data:

hive> load data local inpath '/home/training/Desktop/emp' into table e


mployee;

hdfs mode:
[training@localhost Desktop]$ hadoop fs -ls
/user/hive/warehouse/sample.db
Warning: $HADOOP_HOME is deprecated.

Found 1 items
drwxr-xr-x - training supergroup 0 2018-01-22 17:47 /user/hive/w
arehouse/sample.db/employee

[training@localhost Desktop]$ hadoop fs -cat


/user/hive/warehouse/sample.db /employee/emp

External Tables: In External tables if we drop table table will be


deleted from hive prompt only and it will be still available in
/user/hive/warehouse

External Tables are classified into two types

With Location

Without Location :
Here we create a directroy and create a table and point to that
directory
i.e Data loading and table creation is done in single step

hive > create external table ext_emp(eno int,ename string,job


string,salary double,comm float,deptno int)
> row format delimited
> fields terminated by ','
> location '/user/training/inpt_hive';

Partitions: Used for data segregation and to prevent the entire


table scan
Steps:

Create a temporary table


Load data into temporary table
create partition table
set hive.exec.dynamic.mode=nonstrict
insert data into partition table by select data from temporary table

View:

Creating view: create view view-name as select stmt

sub view: u can create sub view from an existing view.

create view view_name as select column list from view_name

Bucketing: Optimization Technique


Used for join optimization and table sampling.
Here we used clustered by
Steps:

1.Create a temporary table


2.load data into temp table
3.Create buckting table.
4.set hive.enforce.buckting=true
5.insert data into buckting table from temp table.

hive> create table bemp(eno int,ename string,job string,salary


double,comm float,deptno int)
> clustered by (deptno)
> into 3 buckets
> row format delimited
> fields terminated by ',';

select * from bemp tablesample(70 percent);


select * from bemp tablesample (bucket 1 out of 3);

Index: Like a pointer to a particular column( store column address)


Why used: Peformance will be improved.

Index 2 types
1.Bit map
2.compact

create index emp_index on table t1(eno) as 'COMPACT' with deferred


REBUILD;

Array ex:
Ramesh,1,maths$physics$chemistry,a
Suresh,2,bilogy$maths$physices,b
create table array_table(sname string,sid int,sub array<string>,grade
string)
> row format delimited
> fields terminated by ','
> collection items terminated by '$';

select sname,sid,sub[0],sub[1],sub[2],grade from array_table;

Map: [training@localhost Desktop]$ cat map.csv


ravi,123,salary#50000$comm#3000,10
suresh,456,salary#4000$comm#2000,20

hive> create table map_table(ename string,eid int,paydetails


map<string,double>,deptno int)
> row format delimited
> fields terminated by ','
> collection items terminated by '$'
> map keys terminated by '#';

Struct: [training@localhost Desktop]$ cat >struct.csv


ravi,123,456$juhu$mumbai
suresh,345,789$electroniccity$banglore

hive> create table struct_table(ename string,eid int,address


struct<hno:int,street:string,city:string>,deptno int)
> row format delimited
> fields terminated by ','
> collection items terminated by '$';

PIG: It is second generation data analysis/Processing tool


It can process structured,unstructured,semistructured data
Pig is session oriented(Pig doesn't have warehouse or meta
store) like Gajini

Built on top of Map Reduce.

PigEngine it converts pig scripts to mapreduce jobs

Pig can work in two modes


install linux then install pig
Pig -x Mapreduce

install linux.hadoop,pig

local modes: Here we have input file and output file in linux file
system here we can install pig without hadoop mapreduce

input and output files are in HDFS filesystem we need hadoop


and then only pig

Pig Data types:


int,float,long,double
bytearray,chararray,bag,tuple and atom

Note: Default datatype bytearray.


OR define with out headers(Default data type will be taken)
Note: To see the out put or select stmt is DUMP
Filtering the bag(Row):

Note: For every command you have to give bag name.

You might also like