Hbase PDF
Hbase PDF
GkavresisGiorgos1470
Agenda
WhatisHbase
Installation
AboutRDBMS
OverviewofHbase
WhyHbaseinsteadofRDBMS
ArchitectureofHbase
Hbaseinterface
Summarise
WhatisHbase
Hbaseisanopensource,distributedsortedmap
modeledafterGoogle'sBigTable
OpenSource
Apache2.0License
Commitersandcontributorsfromdiverse
organizationslikeFacebook,TrendMicroetc.
Installation
Downloadlink
http://www.apache.org/dyn/closer.cgi/hbase/
Beforestartingit,youmightwanttoedit
conf/hbasesite.xmlandsetthedirectoryyouwant
HBasetowriteto,hbase.rootdir
Canbestandaloneorpseudodistributedand
distributed
StartHbasevia$./bin/starthbase.sh
AboutRelational
DatabaseManagementSystems
HavealotofLimitations
Bothread/writethroughoutnot
possible(transactionaldatabases)
SpecializedHardwareisquiteexpensive
Background
GooglereleasespaperonBigtable2006
FirstusableHbase2007
HbasebecomesApachetoplevenproject2010
Hbase0.26.5released.
OverviewofHbase
HbaseisapartofHadoop
ApacheHadoopisanopensourcesystemto
reliablystoreandprocessdataacrossmany
commoditycomputers
HbaseandHadooparewritteninJava
Hadoopprovides:
Faulttolerance
Scalability
Hadoopadvantages
Datapararellorcomputepararell.Forexample:
Extensivemachinelearningon<100GBofimage
data
SimpleSQLquerieson>100TBofclickstreaming
data
Hadoop'scomponents
MapReduce(Process)
Faulttolerantdistributedprocessing
HDFS(store)
Selfhealing
Highbandwidth
Clusteredstorage
DifferenceBetweenHadoop/HDFS
andHbase
HDFSisadistributedfilesystemthatiswellsuited
forthestorageoflargefiles.HBase,ontheother
hand,isbuiltontopofHDFSandprovidesfast
recordlookups(andupdates)forlargetables.
HDFShasbasedonGFSfilesystem.
Hbaseis
DistributedusesHDFSforstorage
ColumnOriented
MultiDimensional(Versions)
StorageSystem
HbaseisNOT
AsqlDatabaseNoJoins,noqueryengine,no
datatypes,no(damn)sql
NoSchema
NoDBAneeded
StorageModel
Columnorienteddatabase(columnfamilies)
TableconsistsofRows,eachwhichhasaprimary
key(rowkey)
EachRowmayhaveanynumberofcolumns
TableschemaonlydefinesColumnfamiles(column
familycanhaveanynumberofcolumns)
Eachcellvaluehasatimestamp
StaticColumns
int
varchar
int
varchar
int
int
varchar
int
varchar
int
int
varchar
int
varchar
int
Somethingdifferent
Row1ColA=Value
ColB=Value
ColC=Value
Row2ColX=Value
ColY=Value
ColZ=Value
ABigMap
RowKey+ColumnKey+timestamp
=>value
Row Key
Column Key
Timestamp
Value
Info:name
127351619786
8
Sakis
Info:age
127387182418
4
21
Info:sex
127374628143
2
Male
Info:name
127386372322
7
Themis
Info:name
127397313423
8
Andreas
Onemoreexample
Row Key
Data
cutting
Info:{'height':'9ft','state':'CA'}
Roles:{'ASF':Director','Hadoop':'Founder'}
tlipcon
Info:{'height':5ft7','state':'CA'}
Roles:{'Hadoop':'Committer'@ts=2010
'Hadoop':'PMC'@ts=2011
'Hive':'Contributor'}
ColumnFamilies
Differentsetsofcolumnsmayhavedifferent
priorities
CFsstoredseparatelyondiskaccessonewithout
wastingIOontheother.
Configurablebycolumnfamily
Compression(none,gzip,LZO)
Versionretentionpolicies
Cachepriority
HbasevsRDBMS
RDBMS
Hbase
Data layout
Row-oriented
Query language
SQL
Get/put/scan/etc *
Security
TBs
Hundrends of PBs
1000s queries/second
TermsandDaemons
Region
Asubsetoftable'srows,
RegionServer(slave)
Servesdataforreadsandwrites
Master
Responsibleforcoordinatingtheslaves
Assignsregions,detectsfailuresofRegionServers
Controlsomeadminfunction
Distributedcoordination
Tomanagemasterelectionandserveravailability
weuseZookeeper
Setupacluster,providesdistributedcoordination
primitives
Anexcellenttoolforbuildingclustermanagement
systems
HbaseArchitecture
Distributedcoordination
Tomanagemasterelectionandserveravailability
weuseZookeeper
Setupacluster,providesdistributedcoordination
primitives
Anexcellenttoolforbuildingclustermanagement
systems
HbaseInterface
Java
Thrift(Ruby,Php,Python,Perl,C++,..)
HbaseShell
HbaseAPI
get(row)
put(row,Map<column,value>)
scan(keyrange,filter)
increment(row,columns)
CheckandPut,deleteetc.
Hbaseshell
hbase(main):003:0>create'test','cf'
0row(s)in1.2200seconds
hbase(main):004:0>put'test','row1','cf:a','value1'
0row(s)in0.0560seconds
hbase(main):005:0>put'test','row2','cf:b','value2'
0row(s)in0.0370seconds
hbase(main):006:0>put'test','row3','cf:c','value3'
0row(s)in0.0450seconds
Hbaseshellcont.
hbase(main):007:0>scan'test'
ROWCOLUMN+CELL
row1column=cf:a,timestamp=1288380727188,value=value1
row2column=cf:b,timestamp=1288380738440,value=value2
row3column=cf:c,timestamp=1288380747365,value=value3
3row(s)in0.0590seconds
Hbaseinjava
HBaseConfigurationconf=newHBaseConfiguration();
conf.addResource(newPath("/opt/hbase0.19.3/conf/hbasesite.xml"));
HTabletable=newHTable(conf,"test_table");
BatchUpdatebatchUpdate=newBatchUpdate("test_row1");
batchUpdate.put("columnfamily:column1",Bytes.toBytes("somevalue")
);
batchUpdate.delete("column1");
table.commit(batchUpdate);
GetData
Readonecolumnvaluefromarow
Cellcell=table.get("test_row1","columnfamily1:column1");
Toreadonerowwithgivencolumns,useHTable#getRow()method.
RowResultsingleRow=table.getRow(Bytes.toBytes("test_row1")
);
Atoughfacebookapplication
RealtimecountersofURLsshared,linksliked,
impressionsgenerated
20billionevents/day(200Kevents/sec)
~30seclatencyfromclicktocount
HeavyuseofincrementColumnValueAPI
TriedMySQL,Cassandra,settledonHbase
UseHbaseif
Youneedrandomwrire,randomreadorboth(but
notneither)
Youneedtodomanythousandsofoperationsper
seconmultipleTBofdata
Youraccesspatternsaresimple
Thankyou\../