BDA Unit-5
BDA Unit-5
NoSQL is a type of database management system (DBMS) that is designed to handle and
store large volumes of unstructured and semi-structured data. Unlike traditional relational
databases that use tables with pre-defined schemas to store data, NoSQL databases use
flexible data models that can adapt to changes in data structures and are capable of scaling
horizontally to handle growing amounts of data.
The term NoSQL originally referred to “non-SQL” or “non-relational” databases, but the term
has since evolved to mean “not only SQL,” as NoSQL databases have expanded to include a
wide range of different database architectures and data models.
However, NoSQL databases may not be suitable for all applications, as they may not provide
the same level of data consistency and transactional guarantees as traditional relational
databases. It is important to carefully evaluate the specific needs of an application when
choosing a database management system.
NoSQL originally referring to non SQL or non relational is a database that provides a
mechanism for storage and retrieval of data. This data is modeled in means other than the
tabular relations used in relational databases. Such databases came into existence in the late
1960s, but did not obtain the NoSQL moniker until a surge of popularity in the early twenty-
first century. NoSQL databases are used in real-time web applications and big data and their
use are increasing over time.
NoSQL systems are also sometimes called Not only SQL to emphasize the fact that they
may support SQL-like query languages. A NoSQL database includes simplicity of design,
simpler horizontal scaling to clusters of machines and finer control over availability. The
data structures used by NoSQL databases are different from those used by default in
relational databases which makes some operations faster in NoSQL. The suitability of a
given NoSQL database depends on the problem it should solve.
NoSQL databases, also known as “not only SQL” databases, are a new type of database
management system that have gained popularity in recent years. Unlike traditional
relational databases, NoSQL databases are designed to handle large amounts of
unstructured or semi-structured data, and they can accommodate dynamic changes to the
data model. This makes NoSQL databases a good fit for modern web applications, real-
time analytics, and big data processing.
Data structures used by NoSQL databases are sometimes also viewed as more flexible
than relational database tables. Many NoSQL stores compromise consistency in favor of
availability, speed and partition tolerance. Barriers to the greater adoption of NoSQL
stores include the use of low-level query languages, lack of standardized interfaces, and
huge previous investments in existing relational databases.
Most NoSQL stores lack true ACID(Atomicity, Consistency, Isolation, Durability)
transactions but a few databases, such as MarkLogic, Aerospike, FairCom c-treeACE,
Google Spanner (though technically a NewSQL database), Symas LMDB, and OrientDB
have made them central to their designs.
Most NoSQL databases offer a concept of eventual consistency in which database changes
are propagated to all nodes so queries for data might not return updated data immediately
or might result in reading data that is not accurate which is a problem known as stale
reads. Also some NoSQL systems may exhibit lost writes and other forms of data loss.
Some NoSQL systems provide concepts such as write-ahead logging to avoid data loss.
One simple example of a NoSQL database is a document database. In a document
database, data is stored in documents rather than tables. Each document can contain a
different set of fields, making it easy to accommodate changing data requirements
For example, “Take, for instance, a database that holds data regarding employees.”. In a
relational database, this information might be stored in tables, with one table for employee
information and another table for department information. In a document database, each
employee would be stored as a separate document, with all of their information contained
within the document.
NoSQL databases are a relatively new type of database management system that have
gained popularity in recent years due to their scalability and flexibility. They are designed
to handle large amounts of unstructured or semi-structured data and can handle dynamic
changes to the data model. This makes NoSQL databases a good fit for modern web
applications, real-time analytics, and big data processing.
Advantages of NoSQL: There are many advantages of working with NoSQL databases such
as MongoDB and Cassandra. The main advantages are high scalability and high availability.
1. High scalability : NoSQL databases use sharding for horizontal scaling. Partitioning of
data and placing it on multiple machines in such a way that the order of the data is
preserved is sharding. Vertical scaling means adding more resources to the existing
machine whereas horizontal scaling means adding more machines to handle the data.
Vertical scaling is not that easy to implement but horizontal scaling is easy to implement.
Examples of horizontal scaling databases are MongoDB, Cassandra, etc. NoSQL can
handle a huge amount of data because of scalability, as the data grows NoSQL scale itself
to handle that data in an efficient manner.
2. Flexibility: NoSQL databases are designed to handle unstructured or semi-structured
data, which means that they can accommodate dynamic changes to the data model. This
makes NoSQL databases a good fit for applications that need to handle changing data
requirements.
3. High availability : Auto replication feature in NoSQL databases makes it highly
available because in case of any failure data replicates itself to the previous consistent
state.
4. Scalability: NoSQL databases are highly scalable, which means that they can handle
large amounts of data and traffic with ease. This makes them a good fit for applications
that need to handle large amounts of data or traffic
5. Performance: NoSQL databases are designed to handle large amounts of data and traffic,
which means that they can offer improved performance compared to traditional relational
databases.
6. Cost-effectiveness: NoSQL databases are often more cost-effective than traditional
relational databases, as they are typically less complex and do not require expensive
hardware or software.
7. Agility: Ideal for agile development.
Disadvantages of NoSQL: NoSQL has the following disadvantages.
1. Lack of standardization : There are many different types of NoSQL databases, each
with its own unique strengths and weaknesses. This lack of standardization can make it
difficult to choose the right database for a specific application
2. Lack of ACID compliance : NoSQL databases are not fully ACID-compliant, which
means that they do not guarantee the consistency, integrity, and durability of data. This
can be a drawback for applications that require strong data consistency guarantees.
3. Narrow focus : NoSQL databases have a very narrow focus as it is mainly designed for
storage but it provides very little functionality. Relational databases are a better choice in
the field of Transaction Management than NoSQL.
4. Open-source : NoSQL is open-source database. There is no reliable standard for NoSQL
yet. In other words, two database systems are likely to be unequal.
5. Lack of support for complex queries : NoSQL databases are not designed to handle
complex queries, which means that they are not a good fit for applications that require
complex data analysis or reporting.
6. Lack of maturity : NoSQL databases are relatively new and lack the maturity of
traditional relational databases. This can make them less reliable and less secure than
traditional databases.
7. Management challenge : The purpose of big data tools is to make the management of a
large amount of data as simple as possible. But it is not so easy. Data management in
NoSQL is much more complex than in a relational database. NoSQL, in particular, has a
reputation for being challenging to install and even more hectic to manage on a daily
basis.
8. GUI is not available : GUI mode tools to access the database are not flexibly available in
the market.
9. Backup : Backup is a great weak point for some NoSQL databases like MongoDB.
MongoDB has no approach for the backup of data in a consistent manner.
10. Large document size : Some database systems like MongoDB and CouchDB store data
in JSON format. This means that documents are quite large (BigData, network bandwidth,
speed), and having descriptive key names actually hurts since they increase the document
size.
Types of NoSQL database: Types of NoSQL databases and the name of the databases system
that falls in that category are:
1. Graph Databases: Examples – Amazon Neptune, Neo4j
2. Key value store: Examples – Memcached, Redis, Coherence
3. Tabular: Examples – Hbase, Big Table, Accumulo
4. Document-based: Examples – MongoDB, CouchDB, Cloudant
In conclusion, NoSQL databases offer several benefits over traditional relational databases,
such as scalability, flexibility, and cost-effectiveness. However, they also have several
drawbacks, such as a lack of standardization, lack of ACID compliance, and lack of support
for complex queries. When choosing a database for a specific application, it is important to
weigh the benefits and drawbacks carefully to determine the best fit.
SQL
NoSQL
What is SQL?
Structured Query Language or SQL is a table-based relational database. By applying the SQL
programming language, users can now search, insert, modify and delete data from the database
records. This in no way limits the use of SQL. The services it supports are also not limited to the
optimization or administration of the database.
What is NoSQL?
NoSQL is a non-relational database or DMS without any fixed schema, while it is easy to scale.
Distributed data stores that require a large quantity of data storage needs have a call for NoSQL.
Big Data and real-time web apps make use of NoSQL.
The databases in SQL are table-based, while the databases in NoSQL are document, key-value,
graph, or wide-column stores. SQL databases suit multi-row transactions, while NoSQL is better
for unstructured data like documents or JSON. Learn more about what is the difference between
SQL and NoSQL from the table.
SQL NoSQL
SQL is also pronounced as “S-Q-L” or as NoSQL is a distributed or Non-relational
“See-Quel” and is primarily known to be a Database
Relational Database
Use of SQL queries and syntax to analyse Apply different types of database technologies
and get further data insights. Used for
OLAP systems
Database, here is in table format NoSQL databases are document based with
key-value pairs and graph databases.
Total focus on ACID (Atomicity, Makes use of the Brewer’s CAP theorem
Consistency, Isolation and Durability) (Consistency, Availability and Partition
properties Tolerance)
Installing HBase
We can install HBase in any of the three modes: Standalone mode, Pseudo Distributed mode,
and Fully Distributed mode.
Shift to super user mode and move the HBase folder to /usr/local as shown below.
$su
$password: enter your password here
mv hbase-0.99.1/* Hbase/
Before proceeding with HBase, you have to edit the following files and configure HBase.
hbase-env.sh
Set the java Home for HBase and open hbase-env.sh file from the conf folder. Edit
JAVA_HOME environment variable and change the existing path to your current JAVA_HOME
variable as shown below.
cd /usr/local/Hbase/conf
gedit hbase-env.sh
This will open the env.sh file of HBase. Now replace the existing JAVA_HOME value with
your current value as shown below.
export JAVA_HOME=/usr/lib/jvm/java-1.7.0
hbase-site.xml
This is the main configuration file of HBase. Set the data directory to an appropriate location by
opening the HBase home folder in /usr/local/HBase. Inside the conf folder, you will find several
files, open the hbase-site.xml file as shown below.
#cd /usr/local/HBase/
#cd conf
# gedit hbase-site.xml
Inside the hbase-site.xml file, you will find the <configuration> and </configuration> tags.
Within them, set the HBase directory under the property key with the name “hbase.rootdir” as
shown below.
<configuration>
//Here you have to set the path where you want HBase to store its files.
<property>
<name>hbase.rootdir</name>
<value>file:/home/hadoop/HBase/HFiles</value>
</property>
//Here you have to set the path where you want HBase to store its built in zookeeper files.
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/home/hadoop/zookeeper</value>
</property>
</configuration>
With this, the HBase installation and configuration part is successfully complete. We can start
HBase by using start-hbase.sh script provided in the bin
folder of HBase. For that, open HBase Home Folder and run HBase start script as shown below.
$cd /usr/local/HBase/bin
$./start-hbase.sh
If everything goes well, when you try to run HBase start script, it will prompt you a message
saying that HBase has started.
Configuring HBase
Before proceeding with HBase, configure Hadoop and HDFS on your local system or on a
remote system and make sure they are running. Stop HBase if it is running.
hbase-site.xml
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
It will mention in which mode HBase should be run. In the same file from the local file system,
change the hbase.rootdir, your HDFS instance address, using the hdfs://// URI syntax. We are
running HDFS on the localhost at port 8030.
<property>
<name>hbase.rootdir</name>
<value>hdfs://localhost:8030/hbase</value>
</property>
Starting HBase
After configuration is over, browse to HBase home folder and start HBase using the following
command.
$cd /usr/local/HBase
$bin/start-hbase.sh
Note: Before starting HBase, make sure Hadoop is running.
HBase creates its directory in HDFS. To see the created directory, browse to Hadoop bin and
type the following command.
Found 7 items
drwxr-xr-x - hbase users 0 2014-06-25 18:58 /hbase/.tmp
drwxr-xr-x - hbase users 0 2014-06-25 21:49 /hbase/WALs
drwxr-xr-x - hbase users 0 2014-06-25 18:48 /hbase/corrupt
drwxr-xr-x - hbase users 0 2014-06-25 18:58 /hbase/data
-rw-r--r-- 3 hbase users 42 2014-06-25 18:41 /hbase/hbase.id
-rw-r--r-- 3 hbase users 7 2014-06-25 18:41 /hbase/hbase.version
drwxr-xr-x - hbase users 0 2014-06-25 21:49 /hbase/oldWALs
Using the “local-master-backup.sh” you can start up to 10 servers. Open the home folder of
HBase, master and execute the following command to start it.
$ ./bin/local-master-backup.sh 2 4
To kill a backup master, you need its process id, which will be stored in a file
named “/tmp/hbase-USER-X-master.pid.” you can kill the backup master using the following
command.
$ cat /tmp/hbase-user-1-master.pid |xargs kill -9
You can run multiple region servers from a single system using the following command.
$ .bin/local-regionservers.sh start 2 3
To stop a region server, use the following command.
$ .bin/local-regionservers.sh stop 3
Starting HBaseShell
After Installing HBase successfully, you can start HBase Shell. Below given are the sequence of
steps that are to be followed to start the HBase shell. Open the terminal, and login as super user.
Browse through Hadoop home sbin folder and start Hadoop file system as shown below.
$cd $HADOOP_HOME/sbin
$start-all.sh
Start HBase
Browse through the HBase root directory bin folder and start HBase.
$cd /usr/local/HBase
$./bin/start-hbase.sh
Start Region
$./bin/./local-regionservers.sh start 3
$cd bin
$./hbase shell
This will give you the HBase Shell Prompt as shown below.
2014-12-09 14:24:27,526 INFO [main] Configuration.deprecation:
hadoop.native.lib is deprecated. Instead, use io.native.lib.available
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 0.98.8-hadoop2, r6cfc8d064754251365e070a10a82eb169956d5fe, Fri
Nov 14 18:26:29 PST 2014
hbase(main):001:0>
To access the web interface of HBase, type the following url in the browser.
http://localhost:60010
This interface lists your currently running Region servers, backup masters and HBase tables.
HBase Tables
Setting Java Environment
We can also communicate with HBase using Java libraries, but before accessing HBase using
Java API you need to set classpath for those libraries.
Before proceeding with programming, set the classpath to HBase libraries in .bashrc file.
Open .bashrc in any of the editors as shown below.
$ gedit ~/.bashrc
Set classpath for HBase libraries (lib folder in HBase) in it as shown below.
This is to prevent the “class not found” exception while accessing the HBase using java API.
HBase is a distributed column-oriented database built on top of the Hadoop file system. It is a
part of the Hadoop ecosystem that provides random real-time read/write access to data in the
Hadoop File System. One can store the data in HDFS either directly or through HBase. The
following steps are used to query HBase data in Apache Drill.
Step 1: Prerequisites
Before moving on to querying HBase data, you must need to install the following −
Java installed version 1.7 or greater
Hadoop
HBase
Step 2: Enable Storage Plugin
After successful installation navigate to Apache Drill web console and select the storage menu
option as shown in the following screenshot.
Then choose HBase Enable option, after that go to the update option and now you will see the
response as shown in the following program.
{
"type": "hbase",
"config": {
"hbase.zookeeper.quorum": "localhost",
"hbase.zookeeper.property.clientPort": "2181"
},
"size.calculator.enabled": false,
"enabled": true
}
After enabling the plugin, first start your Hadoop server then start HBase.
After Hadoop and HBase has been started, you can start the HBase interactive shell using “hbase
shell” command as shown in the following query.
Query
/bin/hbase shell
Then you will see the response as shown in the following program.
Result
hbase(main):001:0>
Create a Table
Pipe the following commands to the HBase shell to create a “customer” table.
Query
Create a simple text file named “hbase-customers.txt” as shown in the following program.
Example
put 'customers','Alice','account:name','Alice'
put 'customers','Alice','address:street','123 Ballmer Av'
put 'customers','Alice','address:zipcode','12345'
put 'customers','Alice','address:state','CA'
put 'customers','Bob','account:name','Bob'
put 'customers','Bob','address:street','1 Infinite Loop'
put 'customers','Bob','address:zipcode','12345'
put 'customers','Bob','address:state','CA'
put 'customers','Frank','account:name','Frank'
put 'customers','Frank','address:street','435 Walker Ct'
put 'customers','Frank','address:zipcode','12345'
put 'customers','Frank','address:state','CA'
put 'customers','Mary','account:name','Mary'
put 'customers','Mary','address:street','56 Southern Pkwy'
put 'customers','Mary','address:zipcode','12345'
put 'customers','Mary','address:state','CA'
Now, issue the following command in hbase shell to load the data into a table.
Query
Query
Now switch to Apache Drill shell and issue the following command.
Result
+------------+---------------------+---------------------------------------------------------------------------+
| row_key | account | address |
+------------+---------------------+---------------------------------------------------------------------------+
| 416C696365 | {"name":"QWxpY2U="} |
{"state":"Q0E=","street":"MTIzIEJhbGxtZXIgQXY=","zipcode":"MTIzNDU="} |
| 426F62 | {"name":"Qm9i"} |
{"state":"Q0E=","street":"MSBJbmZpbml0ZSBMb29w","zipcode":"MTIzNDU="} |
| 4672616E6B | {"name":"RnJhbms="} |
{"state":"Q0E=","street":"NDM1IFdhbGtlciBDdA==","zipcode":"MTIzNDU="} |
| 4D617279 | {"name":"TWFyeQ=="} |
{"state":"Q0E=","street":"NTYgU291dGhlcm4gUGt3eQ==","zipcode":"MTIzNDU="} |
+------------+---------------------+---------------------------------------------------------------------------+
Apache Drill fetches the HBase data as a binary format, which we can convert into readable data
using CONVERT_FROM function available in drill. Check and use the following query to get
proper data from drill.
Query
Result
+--------------+----------------+-----------------+------------------+--------------------+
| customer_id | customers_name | customers_state | customers_street | customers_zipcode |
+--------------+----------------+-----------------+------------------+--------------------+
| Alice | Alice | CA | 123 Ballmer Av | 12345 |
| Bob | Bob | CA | 1 Infinite Loop | 12345 |
| Frank | Frank | CA | 435 Walker Ct | 12345 |
| Mary | Mary | CA | 56 Southern Pkwy | 12345 |
+--------------+----------------+-----------------+------------------+--------------------+