0% found this document useful (0 votes)
30 views

Slide 6 NoSQL Database and HBase Tutorial

Uploaded by

putinphuc
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

Slide 6 NoSQL Database and HBase Tutorial

Uploaded by

putinphuc
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 110

Big Data

NoSQL database and Hbase tutorial


Trong-Hop Do

S3Lab
Smart Software System Laboratory

1
“Without big data, you are blind and deaf
and in the middle of a freeway.”
– Geoffrey Moore

Big Data 2
NoSQL

3
Background
● Relational databases mainstay of business
● Web-based applications caused spikes
● Explosion of social media sites (Facebook, Twitter) with large data needs
● rise of cloud-based solutions such as Amazon S3 (simple storage solution)
● Hooking RDBMS to web-based application becomes trouble

4
Big Data
What is NoSQL?
● This name stands for Not Only SQL
● The term NOSQL was introduced by Carl Strozzi in 1998 to name his file-
based database
● It was again re-introduced by Eric Evans when an event was organized to
discuss open source distributed databases
○ Eric states that “… but the whole point of seeking alternatives is that you need to solve a
problem that relational databases are a bad fit for. …”

5
Big Data
What is NoSQL?
Key features (Advantages)

● non-relational
● don’t require schema
● data are replicated to multiple nodes (so, identical & fault-tolerant)
and can be partitioned:
○ down nodes easily replaced
○ no single point of failure

● horizontal scalable

6
Big Data
What is NoSQL?
Key features (Advantages)

● cheap, easy to implement (open-source)


● massive write performance
● fast key-value access

7
Big Data
What is NoSQL?
Disadvantages

● Don’t fully support relational features


○ no join, group by, order by operations (except within partitions)
○ no referential integrity constraints across partitions

● No declarative query language (e.g., SQL) more programming


● Relaxed ACID (see CAP theorem) fewer guarantees
● No easy integration with other applications that support SQL

8
Big Data
Who is using them?

9
Big Data
3 major papers for NoSQL

● Three major papers were the “seeds” of the NOSQL movement:


○ BigTable (Google)
○ DynamoDB (Amazon)
■ Ring partition and replication
■ Gossip protocol (discovery and error detection)
■ Distributed key-value data stores
■ Eventual consistency
○ CAP Theorem

10
Big Data
The perfect storm

● Large datasets, acceptance of alternatives, and dynamically-typed data


has come together in a “perfect storm”
● Not a backlash against RDBMS
● SQL is a rich query language that cannot be rivaled by the current list of
NOSQL offerings

11
Big Data
CAP Theorem
● Suppose three properties of a distributed system (sharing data)
○ Consistency:
■ Reads and writes are always executed atomically and are strictly consistent
(linearizable). Put differently, all clients have the same view on the data at all times.
○ Availability:
■ Every non-failing node in the system can always accept read and write requests by
clients and will eventually return with a meaningful response, i.e. not with an error
message.
○ Partition-tolerance:
■ system properties (consistency and/or availability) hold even when network failures
prevent some machines from communicating with others. A system can continue to

Big Data operate in the presence of a network partitions 12


CAP Theorem

● Brewer’s CAP Theorem:


○ For any system sharing data, it is “impossible” to guarantee simultaneously all of these
three properties
○ You can have at most two of these three properties for any shared-data system

● Very large systems will “partition” at some point:


○ That leaves either C or A to choose from (traditional DBMS prefers C over A and P )
○ In almost all cases, you would choose A over C (except in specific applications such as
order processing)

13
Big Data
CAP Theorem
Consistency

14
Big Data
CAP Theorem
Consistency

● Have 2 types of consistency:


○ Strong consistency – ACID (Atomicity, Consistency, Isolation, Durability)
○ Weak consistency – BASE (Basically Available Soft-state Eventual consistency)

15
Big Data
CAP Theorem
Consistency
● A consistency model determines rules for visibility and apparent order of
updates
● Example:
○ Row X is replicated on nodes M and N
○ Client A writes row X to node N
○ Some period of time t elapses
○ Client B reads row X from node M
○ Does client B see the write from client A?
○ Consistency is a continuum with tradeoffs
○ For NOSQL, the answer would be: “maybe”
○ CAP theorem states: “strong consistency can't be achieved at the same time as
availability and partition-tolerance”
16
Big Data
NoSQL

● “No-schema” is a common characteristics of most NOSQL storage systems


● Provide “flexible” data types
● Other or additional query languages than SQL
● Distributed – horizontal scaling

● Less structured data


● Supports big data

17
Big Data
NoSQL Categories

18
Big Data
NoSQL Categories
Key-value
● Focus on scaling to huge amounts of data
● Designed to handle massive load
● Based on Amazon’s dynamo paper
● Data model: (global) collection of Key-value pairs
● Dynamo ring partitioning and replication
● Example: (DynamoDB)
○ items having one or more attributes (name, value)
○ An attribute can be single-valued or multivalued like set.
○ items are combined into a table

19
Big Data
NoSQL Categories
Key-value
● Basic API access:
○ get(key): extract the value given a key
○ put(key, value): create or update the value given its key
○ delete(key): remove the key and its associated value
○ execute(key, operation, parameters): invoke an operation to the value (given its key)
which is a special data structure (e.g. List, Set, Map .... etc)

20
Big Data
NoSQL Categories
Key-value
● Pros:
○ very fast
○ very scalable (horizontally distributed to nodes based on key)
○ simple data model
○ eventual consistency
○ fault-tolerance
● Cons:
○ Can’t model more complex data structure such as objects

21
Big Data
NoSQL Categories
Key-value
Name Producer Data model Querying

SimpleDB Amazon set of couples (key, {attribute}), where restricted SQL; select, delete,
attribute is a couple (name, value) GetAttributes, and PutAttributes
operations

Redis Salvatore set of couples (key, value), where value primitive operations for each value
Sanfilippo is simple typed value, list, ordered type
(according to ranking) or unordered set,
hash value

Dynamo Amazon like SimpleDB simple get operation and put in a


context

Voldemort LinkedIn like SimpleDB similar to Dynamo

22
Big Data
NoSQL Categories
Key-value

23
Big Data
NoSQL Categories
Column-based

● Based on Google’s BigTable paper


● Like column oriented relational databases (store data in column order) but
with a twist
● Tables similarly to RDBMS, but handle semi-structured
● Data model:
○ Collection of Column Families
○ Column family = (key, value) where value = set of related columns (standard, super)
○ indexed by row key, column key and timestamp

24
Big Data
NoSQL Categories
Column-based

25
Big Data
NoSQL Categories
Column-based: Keyspace ~ Schema, Column Family ~ Table

26
Big Data
NoSQL Categories
Column-based: Row structure

27
Big Data
NoSQL Categories
Column-based

● One column family can have variable numbers of columns


● Cells within a column family are sorted “physically”
● Very sparse, most cells have null values
● Comparison: RDBMS vs column-based NOSQL
○ Query on multiple tables
■ RDBMS: must fetch data from several places on disk and glue together
■ Column-based NOSQL: only fetch column families of those columns that are required
by a query (all columns in a column family are stored together on the disk, so multiple
rows can be retrieved in one read operation data locality)
28
Big Data
NoSQL Categories
Column-based

29
Big Data
NoSQL Categories
Column-based
● Example: (Cassandra column family--timestamps removed for simplicity)

UserProfile = {
Cassandra = {
emailAddress:”[email protected]” ,
age:”20”
}
TerryCho = {
emailAddress:”[email protected]” ,
gender:”male”
}
Cath = {
emailAddress:”[email protected]”,
age:”20”,gender:”female”,address:”Seoul”
}
Big Data } 30
NoSQL Categories
Column-based
Name Producer Data model Querying

BigTable Google set of couples (key, {value}) selection (by combination of row, column, and time stamp ranges)

HBase Apache groups of columns (a BigTable clone) JRUBY IRB-based shell (similar to SQL)

Hypertable Hypertable like BigTable HQL (Hypertext Query Language)

CASSANDRA Apache columns, groups of columns simple selections on key, range queries, column or columns
(originally corresponding to a key ranges
Facebook) (supercolumns)

PNUTS Yahoo (hashed or ordered) tables, typed selection and projection from a single table (retrieve an arbitrary
arrays, flexible schema single record by primary key, range queries, complex predicates,
ordering, top-k)

31
Big Data
NoSQL Categories
Document-based

● Can model more complex objects


● Inspired by Lotus Notes
● Data model: collection of documents
● Document: JSON (JavaScript Object Notation is a data model, key-value
pairs, which supports objects, records, structs, lists, array, maps, dates,
Boolean with nesting), XML, other semi-structured formats.

32
Big Data
NoSQL Categories
Document-based

● Example: (MongoDB) document


{
Name:"Jaroslav",
Address:"Malostranske nám. 25, 118 00 Praha 1”,
Grandchildren: {Claire: "7", Barbara: "6", "Magda: "3", "Kirsten: "1", "Otis: "3", Richard: "1“}
Phones: [ “123-456-7890”, “234-567-8963” ]
}

33
Big Data
NoSQL Categories
Document-based

34
Big Data
NoSQL Categories
Document-based

35
Big Data
NoSQL Categories
Document-based
Name Producer Data model Querying

MongoDB 10gen object-structured documents manipulations with objects in


stored in collections; collections (find object or
each object has a primary key objects via simple selections
called ObjectId and logical expressions,
delete, update,)

Couchbase Couchbase1 document as a list of named by key and key range, views
(structured) items (JSON via Javascript and
document) MapReduce

36
Big Data
NoSQL Categories
Graph-based

● Focus on modeling the structure of data (interconnectivity)


● A graph is composed of two elements: a node and a relationship.
● Scales to the complexity of data
● Inspired by mathematical Graph Theory (G=(E,V))
● Data model:
○ (Property Graph) nodes and edges
■ Nodes may have properties (including ID)
■ Edges may have labels or roles
○ Key-value pairs on both 37
Big Data
NoSQL Categories
Graph-based
● Interfaces and query languages vary
● Single-step vs path expressions vs full recursion
● Example:
○ Neo4j, FlockDB, Pregel, InfoGrid …

38
Big Data
NoSQL Categories
Graph-based

39
Big Data
NoSQL Categories
Graph-based

40
Big Data
NoSQL Categories
Comparison

41
Big Data
Conclusion
● NOSQL database cover only a part of data-intensive cloud applications
(mainly Web applications)
● Problems with cloud computing:
○ SaaS (Software as a Service or on-demand software) applications require enterprise-
level functionality, including ACID transactions, security, and other features associated
with commercial RDBMS technology, i.e. NOSQL should not be the only option in the
cloud
○ Hybrid solutions:
■ Voldemort with MySQL as one of storage backend
■ deal with NOSQL data as semi-structured data
Big Data -> integrating RDBMS and NOSQL via SQL/XML 42
Conclusion
● next generation of highly scalable and elastic RDBMS: NewSQL
databases (from April 2011)
○ they are designed to scale out horizontally on shared nothing machines,
○ still provide ACID guarantees,
○ applications interact with the database primarily using SQL,
○ the system employs a lock-free concurrency control scheme to avoid user shut down,
○ the system provides higher performance than available from the traditional systems.

● Examples: MySQL Cluster (most mature solution), VoltDB, Clustrix,


ScalArc, etc.

43
Big Data
Hadoop Ecosystem

44
Big Data
HBase tutorial

45
Hbase tutorial
What is HBase?
• HBase is a distributed column-oriented database built on top of the Hadoop file system.
• HBase is a data model that is similar to Google’s big table designed to provide quick random
access to huge amounts of structured data. It leverages the fault tolerance provided by the
Hadoop File System (HDFS).
• It is a part of the Hadoop ecosystem that provides random real-time read/write access to
data in the Hadoop File System.
• One can store the data in HDFS either directly or through HBase. Data consumer
reads/accesses the data in HDFS randomly using HBase. HBase sits on top of the Hadoop
File System and provides read and write access

46
Big Data
Hbase tutorial
What is HBase?

47
Big Data
Hbase tutorial
HDFS vs HBase

48
Big Data
Hbase tutorial
What is HBase?
• HBase is a column-oriented database and the tables in it are sorted by row. The table
schema defines only column families, which are the key value pairs.

• Table is a collection of rows.


• Row is a collection of column families.
• Column family is a collection of columns.
• Column is a collection of key value pairs.

49
Big Data
Hbase tutorial
Column Oriented and Row Oriented

50
Big Data
Hbase tutorial
HBase and RDBMS

51
Big Data
Hbase tutorial
Features of HBase
• HBase is linearly scalable.
• It has automatic failure support.
• It provides consistent read and writes.
• It integrates with Hadoop, both as a source and a destination.
• It has easy java API for client.
• It provides data replication across clusters.

52
Big Data
Hbase tutorial
Where to Use HBase
• Apache HBase is used to have random, real-time read/write access to Big Data.

• It hosts very large tables on top of clusters of commodity hardware.

• Apache HBase is a non-relational database modeled after Google's Bigtable. Bigtable acts
up on Google File System, likewise Apache HBase works on top of Hadoop and HDFS.

53
Big Data
Hbase tutorial

54
Hbase tutorial

● Accessing HBase by using the HBase Shell


● Command: hbase shell

55
Big Data
Hbase tutorial

• Check the shell functioning before proceeding further. Use the list
command for this purpose. List is a command used to get the list of all
the tables in HBase.

56
Big Data
Hbase tutorial

• Command: status
This command returns the status of the system including the details of the servers
running on the system. Its syntax is as follows:

• Command: table_help
This command guides you what and how to use table-referenced commands.
Given below is the syntax to use this command.
57
Big Data
Hbase tutorial
Creating a Table using HBase Shell
Command: create ‘<table name>’,’<column family>’

verify whether the table is created using the list command

58
Big Data
Hbase tutorial
Creating a Table using HBase Shell
Check the table

59
Big Data
Hbase tutorial
Creating a Table Using java API

● Create Java Project

61
Big Data
Hbase tutorial
Creating a Table Using java API

● Add External JARs

62
Big Data
Hbase tutorial
Creating a Table Using java API

● Add all .jar files in /usr/lib/hbase

63
Big Data
Hbase tutorial
Creating a Table Using java API

● Add all .jar files in /usr/lib/hbase/lib

64
Big Data
Hbase tutorial
Creating a Table Using java API

● Add all .jar files in /usr/lib/hadoop

65
Big Data
Hbase tutorial

● Add all .jar files in /usr/lib/hadoop/clident

66
Big Data
Hbase tutorial
Creating a Table Using java API

● Create new .java file and run it

67
Big Data
Hbase tutorial

● The console output should be like this

68
Big Data
Hbase tutorial
Creating a Table Using java API

● Check if the table employee has been created

69
Big Data
Hbase tutorial
Creating a Table Using java API

70
Hbase tutorial
Listing Tables Using Java API

● Create new Java file in the same project

71
Big Data
Hbase tutorial
Listing Tables Using Java API

● Paste the code and execute it

72
Big Data
Hbase tutorial
Listing Tables Using Java API

● The console output should be like this

73
Big Data
Hbase tutorial
Writing Data to HBase

education

Le hanoi MBA
Phạm tp hcm bachelor

Tran vung tau bachelor

74
Big Data
Hbase tutorial
Writing Data to HBase

● Let us insert the first row values into the emp table as shown below

75
Big Data
Hbase tutorial
Writing Data to HBase

● Check the content of the table

76
Big Data
Hbase tutorial
Writing Data to HBase

● Copy these lines and paste in the Hbase shell (you can type them if you want)

77
Big Data
Hbase tutorial
Writing Data to HBase

● Check the content of the table

78
Big Data
Hbase tutorial
Writing Data to HbaseUsing Java API
Step 1: Instantiate the Configuration Class
Configuration conf = HbaseConfiguration.create();
Step 2:Instantiate the HTable Class
HTable hTable = new HTable(conf, tableName);
Step 3: Instantiate the PutClass
This class requires the row name you want to
Put p = new Put(Bytes.toBytes("row id")); insert the data into, in string format.
Step 4: Insert Data
p.add(Bytes.toBytes("coloumn family "), Bytes.toBytes("column name"),Bytes.toBytes("value"));
Step 5: Save the Data in Table
hTable.put(p);
Step 6: Close the HTable Instance
hTable.close();

79
Big Data
Hbase tutorial
Writing Data to HbaseUsing Java API

● Create new .java file

80
Big Data
Hbase tutorial
Writing Data to HbaseUsing Java API

● Paste the code, save, and run it

81
Big Data
Hbase tutorial

82
Big Data
Hbase tutorial

● Check the result

83
Big Data
Hbase tutorial

● Edit the code

Put p = new Put(Bytes.toBytes(“row4")); Put p = new Put(Bytes.toBytes("4"));

● Then run the code

84
Big Data
Hbase tutorial
Writing Data to HbaseUsing Java API

● Scan the table to see the result.


● In the table, “4” and “row4” are different rows

85
Big Data
Hbase tutorial
Reading Data using HBase Shell

● Command: get ‘<table name>’ , ‘<row id>’

86
Big Data
Hbase tutorial
Reading a Specific Column using HBase Shell
● Command: get 'table name', ‘row id’, {COLUMN ⇒ ‘column family:column name ’}

87
Big Data
Hbase tutorial
Updating Data using HBase Shell

● Command: put ‘table name’,’row id’,'Column family:column name',’new value’

88
Big Data
Hbase tutorial
Deleting a Specific Cell in a Table

● Command: delete ‘<table name>’, ‘<row id>’, ‘<column name >’, ‘<time stamp>’

89
Big Data
Hbase tutorial
Deleting all the cells in a row

● Command: deleteall ‘<table name>’, ‘<row id>’

90
Big Data
Hbase tutorial
Deleting a Column Family

● Command: alter ‘ <table name> ’, ‘delete’ ⇒ ‘ <column family> ’

91
Big Data
Hbase tutorial
VERSION
● When you put data into HBase, a timestamp is required.

● The timestamp can be generated automatically by the RegionServer or can be supplied by


you.
● The timestamp must be unique per version of a given cell, because the timestamp identifies
the version.
● To modify a previous version of a cell, for instance, you would issue a Put with a different
value for the data itself, but the same timestamp.

Command: put ‘table name’,’row id’,'Column family:column name',’new value’, timestamp

92
Big Data
Hbase tutorial
VERSION
● Doing a put always creates a new version of a cell, at a certain timestamp.

● Default update

93
Big Data
Hbase tutorial
Change the maximum number of versions
● Get 2 versions of that cell

● We receive only the latest version of that cell (which is ‘CEO’)


● The reason is because the maximum number of versions defaults to 1
● Use alter command to change the maximum number of versions of that family column

94
Big Data
Hbase tutorial
Get all versions of a cell

● Let’s try again to get 2 versions of that cell

95
Big Data
Hbase tutorial
Update a specific version
Let’s get all versions of the cell

Command: put ‘table name’,’row id’,'Column family:column name',’new value’, timestamp

96
Big Data
Hbase tutorial
Load CSV file from HDFS to HBase

● Create a csv file (e.g. using gedit)

● Put the file to HDFS

97
Big Data
Hbase tutorial
Load CSV file from HDFS to HBase
● Navigate to HBase directory

● Command: hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=',’ -Dimporttsv.columns= …

Important: no space before and after ‘,’ when listing columns

won’t run
98
Big Data
Hbase tutorial
Load CSV file from HDFS to HBase

● Scan the table in hbase shell

99
Big Data
Hbase tutorial
Load data from Hive to HBase

● Check the Hive table student

100
Big Data
Hbase tutorial
Create HBase-Hive Mapping table

• Create another hive table which actually points to an HBase table


• hbase_student is the name of Hive table
• studen_hbase is the name of Hbase table linked to the Hive table above

• Command: create table hbase_student (id int,name string,course string,age int) STORED BY
'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES
("hbase.columns.mapping" = ":key,personal:name,personal:course,additional:age") TBLPROPERTIES
("hbase.table.name" = "studen_hbase");

101
Big Data
Hbase tutorial
Load data from Hive to HBase

● Check if the new table has been created in Hbase, then check its schema

102
Big Data
Hbase tutorial
Load data from Hive to HBase

● Migrate hive table data to HBase

103
Big Data
Hbase tutorial
Load data from Hive to HBase

● Check the table in Hive

104
Big Data
Hbase tutorial
Load CSV file from HDFS to HBase

● Check the Hbase table

105
Big Data
Hbase tutorial
Dropping a Table using HBase Shell
● Using the drop command, you can delete a table. Before dropping a table, you have to
disable it.

106
Big Data
Hbase tutorial
Hfile stored in HDFS

● Open HUE and use File Browsers

107
Big Data
Hbase tutorial
Hfile stored in HDFS

● Navigate to /hbase/data/default

108
Big Data
Hbase tutorial
Hfile stored in HDFS

109
Big Data
Hbase tutorial
Hfile stored in HDFS

Important note: different column families are stored separatedly. When you query a row, the region server will
have to grap data in multiple places (which will slow down your system).
110
Big Data
Hbase tutorial
Hfile stored in HDFS

● Check the content of the Hfile (shown in binary and text format)

Big Data ther 111

You might also like