Slide 6 NoSQL Database and HBase Tutorial
Slide 6 NoSQL Database and HBase Tutorial
S3Lab
Smart Software System Laboratory
1
“Without big data, you are blind and deaf
and in the middle of a freeway.”
– Geoffrey Moore
Big Data 2
NoSQL
3
Background
● Relational databases mainstay of business
● Web-based applications caused spikes
● Explosion of social media sites (Facebook, Twitter) with large data needs
● rise of cloud-based solutions such as Amazon S3 (simple storage solution)
● Hooking RDBMS to web-based application becomes trouble
4
Big Data
What is NoSQL?
● This name stands for Not Only SQL
● The term NOSQL was introduced by Carl Strozzi in 1998 to name his file-
based database
● It was again re-introduced by Eric Evans when an event was organized to
discuss open source distributed databases
○ Eric states that “… but the whole point of seeking alternatives is that you need to solve a
problem that relational databases are a bad fit for. …”
5
Big Data
What is NoSQL?
Key features (Advantages)
● non-relational
● don’t require schema
● data are replicated to multiple nodes (so, identical & fault-tolerant)
and can be partitioned:
○ down nodes easily replaced
○ no single point of failure
● horizontal scalable
6
Big Data
What is NoSQL?
Key features (Advantages)
7
Big Data
What is NoSQL?
Disadvantages
8
Big Data
Who is using them?
9
Big Data
3 major papers for NoSQL
10
Big Data
The perfect storm
11
Big Data
CAP Theorem
● Suppose three properties of a distributed system (sharing data)
○ Consistency:
■ Reads and writes are always executed atomically and are strictly consistent
(linearizable). Put differently, all clients have the same view on the data at all times.
○ Availability:
■ Every non-failing node in the system can always accept read and write requests by
clients and will eventually return with a meaningful response, i.e. not with an error
message.
○ Partition-tolerance:
■ system properties (consistency and/or availability) hold even when network failures
prevent some machines from communicating with others. A system can continue to
13
Big Data
CAP Theorem
Consistency
14
Big Data
CAP Theorem
Consistency
15
Big Data
CAP Theorem
Consistency
● A consistency model determines rules for visibility and apparent order of
updates
● Example:
○ Row X is replicated on nodes M and N
○ Client A writes row X to node N
○ Some period of time t elapses
○ Client B reads row X from node M
○ Does client B see the write from client A?
○ Consistency is a continuum with tradeoffs
○ For NOSQL, the answer would be: “maybe”
○ CAP theorem states: “strong consistency can't be achieved at the same time as
availability and partition-tolerance”
16
Big Data
NoSQL
17
Big Data
NoSQL Categories
18
Big Data
NoSQL Categories
Key-value
● Focus on scaling to huge amounts of data
● Designed to handle massive load
● Based on Amazon’s dynamo paper
● Data model: (global) collection of Key-value pairs
● Dynamo ring partitioning and replication
● Example: (DynamoDB)
○ items having one or more attributes (name, value)
○ An attribute can be single-valued or multivalued like set.
○ items are combined into a table
19
Big Data
NoSQL Categories
Key-value
● Basic API access:
○ get(key): extract the value given a key
○ put(key, value): create or update the value given its key
○ delete(key): remove the key and its associated value
○ execute(key, operation, parameters): invoke an operation to the value (given its key)
which is a special data structure (e.g. List, Set, Map .... etc)
20
Big Data
NoSQL Categories
Key-value
● Pros:
○ very fast
○ very scalable (horizontally distributed to nodes based on key)
○ simple data model
○ eventual consistency
○ fault-tolerance
● Cons:
○ Can’t model more complex data structure such as objects
21
Big Data
NoSQL Categories
Key-value
Name Producer Data model Querying
SimpleDB Amazon set of couples (key, {attribute}), where restricted SQL; select, delete,
attribute is a couple (name, value) GetAttributes, and PutAttributes
operations
Redis Salvatore set of couples (key, value), where value primitive operations for each value
Sanfilippo is simple typed value, list, ordered type
(according to ranking) or unordered set,
hash value
22
Big Data
NoSQL Categories
Key-value
23
Big Data
NoSQL Categories
Column-based
24
Big Data
NoSQL Categories
Column-based
25
Big Data
NoSQL Categories
Column-based: Keyspace ~ Schema, Column Family ~ Table
26
Big Data
NoSQL Categories
Column-based: Row structure
27
Big Data
NoSQL Categories
Column-based
29
Big Data
NoSQL Categories
Column-based
● Example: (Cassandra column family--timestamps removed for simplicity)
UserProfile = {
Cassandra = {
emailAddress:”[email protected]” ,
age:”20”
}
TerryCho = {
emailAddress:”[email protected]” ,
gender:”male”
}
Cath = {
emailAddress:”[email protected]”,
age:”20”,gender:”female”,address:”Seoul”
}
Big Data } 30
NoSQL Categories
Column-based
Name Producer Data model Querying
BigTable Google set of couples (key, {value}) selection (by combination of row, column, and time stamp ranges)
HBase Apache groups of columns (a BigTable clone) JRUBY IRB-based shell (similar to SQL)
CASSANDRA Apache columns, groups of columns simple selections on key, range queries, column or columns
(originally corresponding to a key ranges
Facebook) (supercolumns)
PNUTS Yahoo (hashed or ordered) tables, typed selection and projection from a single table (retrieve an arbitrary
arrays, flexible schema single record by primary key, range queries, complex predicates,
ordering, top-k)
31
Big Data
NoSQL Categories
Document-based
32
Big Data
NoSQL Categories
Document-based
33
Big Data
NoSQL Categories
Document-based
34
Big Data
NoSQL Categories
Document-based
35
Big Data
NoSQL Categories
Document-based
Name Producer Data model Querying
Couchbase Couchbase1 document as a list of named by key and key range, views
(structured) items (JSON via Javascript and
document) MapReduce
36
Big Data
NoSQL Categories
Graph-based
38
Big Data
NoSQL Categories
Graph-based
39
Big Data
NoSQL Categories
Graph-based
40
Big Data
NoSQL Categories
Comparison
41
Big Data
Conclusion
● NOSQL database cover only a part of data-intensive cloud applications
(mainly Web applications)
● Problems with cloud computing:
○ SaaS (Software as a Service or on-demand software) applications require enterprise-
level functionality, including ACID transactions, security, and other features associated
with commercial RDBMS technology, i.e. NOSQL should not be the only option in the
cloud
○ Hybrid solutions:
■ Voldemort with MySQL as one of storage backend
■ deal with NOSQL data as semi-structured data
Big Data -> integrating RDBMS and NOSQL via SQL/XML 42
Conclusion
● next generation of highly scalable and elastic RDBMS: NewSQL
databases (from April 2011)
○ they are designed to scale out horizontally on shared nothing machines,
○ still provide ACID guarantees,
○ applications interact with the database primarily using SQL,
○ the system employs a lock-free concurrency control scheme to avoid user shut down,
○ the system provides higher performance than available from the traditional systems.
43
Big Data
Hadoop Ecosystem
44
Big Data
HBase tutorial
45
Hbase tutorial
What is HBase?
• HBase is a distributed column-oriented database built on top of the Hadoop file system.
• HBase is a data model that is similar to Google’s big table designed to provide quick random
access to huge amounts of structured data. It leverages the fault tolerance provided by the
Hadoop File System (HDFS).
• It is a part of the Hadoop ecosystem that provides random real-time read/write access to
data in the Hadoop File System.
• One can store the data in HDFS either directly or through HBase. Data consumer
reads/accesses the data in HDFS randomly using HBase. HBase sits on top of the Hadoop
File System and provides read and write access
46
Big Data
Hbase tutorial
What is HBase?
47
Big Data
Hbase tutorial
HDFS vs HBase
48
Big Data
Hbase tutorial
What is HBase?
• HBase is a column-oriented database and the tables in it are sorted by row. The table
schema defines only column families, which are the key value pairs.
49
Big Data
Hbase tutorial
Column Oriented and Row Oriented
50
Big Data
Hbase tutorial
HBase and RDBMS
51
Big Data
Hbase tutorial
Features of HBase
• HBase is linearly scalable.
• It has automatic failure support.
• It provides consistent read and writes.
• It integrates with Hadoop, both as a source and a destination.
• It has easy java API for client.
• It provides data replication across clusters.
52
Big Data
Hbase tutorial
Where to Use HBase
• Apache HBase is used to have random, real-time read/write access to Big Data.
• Apache HBase is a non-relational database modeled after Google's Bigtable. Bigtable acts
up on Google File System, likewise Apache HBase works on top of Hadoop and HDFS.
53
Big Data
Hbase tutorial
54
Hbase tutorial
55
Big Data
Hbase tutorial
• Check the shell functioning before proceeding further. Use the list
command for this purpose. List is a command used to get the list of all
the tables in HBase.
56
Big Data
Hbase tutorial
• Command: status
This command returns the status of the system including the details of the servers
running on the system. Its syntax is as follows:
• Command: table_help
This command guides you what and how to use table-referenced commands.
Given below is the syntax to use this command.
57
Big Data
Hbase tutorial
Creating a Table using HBase Shell
Command: create ‘<table name>’,’<column family>’
58
Big Data
Hbase tutorial
Creating a Table using HBase Shell
Check the table
59
Big Data
Hbase tutorial
Creating a Table Using java API
61
Big Data
Hbase tutorial
Creating a Table Using java API
62
Big Data
Hbase tutorial
Creating a Table Using java API
63
Big Data
Hbase tutorial
Creating a Table Using java API
64
Big Data
Hbase tutorial
Creating a Table Using java API
65
Big Data
Hbase tutorial
66
Big Data
Hbase tutorial
Creating a Table Using java API
67
Big Data
Hbase tutorial
68
Big Data
Hbase tutorial
Creating a Table Using java API
69
Big Data
Hbase tutorial
Creating a Table Using java API
70
Hbase tutorial
Listing Tables Using Java API
71
Big Data
Hbase tutorial
Listing Tables Using Java API
72
Big Data
Hbase tutorial
Listing Tables Using Java API
73
Big Data
Hbase tutorial
Writing Data to HBase
education
Le hanoi MBA
Phạm tp hcm bachelor
74
Big Data
Hbase tutorial
Writing Data to HBase
● Let us insert the first row values into the emp table as shown below
75
Big Data
Hbase tutorial
Writing Data to HBase
76
Big Data
Hbase tutorial
Writing Data to HBase
● Copy these lines and paste in the Hbase shell (you can type them if you want)
77
Big Data
Hbase tutorial
Writing Data to HBase
78
Big Data
Hbase tutorial
Writing Data to HbaseUsing Java API
Step 1: Instantiate the Configuration Class
Configuration conf = HbaseConfiguration.create();
Step 2:Instantiate the HTable Class
HTable hTable = new HTable(conf, tableName);
Step 3: Instantiate the PutClass
This class requires the row name you want to
Put p = new Put(Bytes.toBytes("row id")); insert the data into, in string format.
Step 4: Insert Data
p.add(Bytes.toBytes("coloumn family "), Bytes.toBytes("column name"),Bytes.toBytes("value"));
Step 5: Save the Data in Table
hTable.put(p);
Step 6: Close the HTable Instance
hTable.close();
79
Big Data
Hbase tutorial
Writing Data to HbaseUsing Java API
80
Big Data
Hbase tutorial
Writing Data to HbaseUsing Java API
81
Big Data
Hbase tutorial
82
Big Data
Hbase tutorial
83
Big Data
Hbase tutorial
84
Big Data
Hbase tutorial
Writing Data to HbaseUsing Java API
85
Big Data
Hbase tutorial
Reading Data using HBase Shell
86
Big Data
Hbase tutorial
Reading a Specific Column using HBase Shell
● Command: get 'table name', ‘row id’, {COLUMN ⇒ ‘column family:column name ’}
87
Big Data
Hbase tutorial
Updating Data using HBase Shell
88
Big Data
Hbase tutorial
Deleting a Specific Cell in a Table
● Command: delete ‘<table name>’, ‘<row id>’, ‘<column name >’, ‘<time stamp>’
89
Big Data
Hbase tutorial
Deleting all the cells in a row
90
Big Data
Hbase tutorial
Deleting a Column Family
91
Big Data
Hbase tutorial
VERSION
● When you put data into HBase, a timestamp is required.
92
Big Data
Hbase tutorial
VERSION
● Doing a put always creates a new version of a cell, at a certain timestamp.
● Default update
93
Big Data
Hbase tutorial
Change the maximum number of versions
● Get 2 versions of that cell
94
Big Data
Hbase tutorial
Get all versions of a cell
95
Big Data
Hbase tutorial
Update a specific version
Let’s get all versions of the cell
96
Big Data
Hbase tutorial
Load CSV file from HDFS to HBase
97
Big Data
Hbase tutorial
Load CSV file from HDFS to HBase
● Navigate to HBase directory
won’t run
98
Big Data
Hbase tutorial
Load CSV file from HDFS to HBase
99
Big Data
Hbase tutorial
Load data from Hive to HBase
100
Big Data
Hbase tutorial
Create HBase-Hive Mapping table
• Command: create table hbase_student (id int,name string,course string,age int) STORED BY
'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES
("hbase.columns.mapping" = ":key,personal:name,personal:course,additional:age") TBLPROPERTIES
("hbase.table.name" = "studen_hbase");
101
Big Data
Hbase tutorial
Load data from Hive to HBase
● Check if the new table has been created in Hbase, then check its schema
102
Big Data
Hbase tutorial
Load data from Hive to HBase
103
Big Data
Hbase tutorial
Load data from Hive to HBase
104
Big Data
Hbase tutorial
Load CSV file from HDFS to HBase
105
Big Data
Hbase tutorial
Dropping a Table using HBase Shell
● Using the drop command, you can delete a table. Before dropping a table, you have to
disable it.
106
Big Data
Hbase tutorial
Hfile stored in HDFS
107
Big Data
Hbase tutorial
Hfile stored in HDFS
● Navigate to /hbase/data/default
108
Big Data
Hbase tutorial
Hfile stored in HDFS
109
Big Data
Hbase tutorial
Hfile stored in HDFS
Important note: different column families are stored separatedly. When you query a row, the region server will
have to grap data in multiple places (which will slow down your system).
110
Big Data
Hbase tutorial
Hfile stored in HDFS
● Check the content of the Hfile (shown in binary and text format)