0% found this document useful (0 votes)
180 views83 pages

Bda Unit-5 PDF

Unit 5 covers topics related to NoSQL databases including key-value stores, document stores, column stores, graph databases, and characteristics of NoSQL databases. It discusses advantages of NoSQL such as flexibility to handle diverse data, ease of distribution and scaling, and relaxed consistency requirements compared to relational databases. Some disadvantages are lack of transactions and weaker security. NewSQL aims to provide scalability of NoSQL with consistency of SQL databases.

Uploaded by

Harry
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
180 views83 pages

Bda Unit-5 PDF

Unit 5 covers topics related to NoSQL databases including key-value stores, document stores, column stores, graph databases, and characteristics of NoSQL databases. It discusses advantages of NoSQL such as flexibility to handle diverse data, ease of distribution and scaling, and relaxed consistency requirements compared to relational databases. Some disadvantages are lack of transactions and weaker security. NewSQL aims to provide scalability of NoSQL with consistency of SQL databases.

Uploaded by

Harry
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 83

Unit- 5

Topics
• Introduction to NoSQL Databases

• Introduction to Hive
Relational Databases
• A relational database refers to a database that stores data in a structured format, using rows and
columns. This makes it easy to locate and access specific values within the database
• Relation is sometimes used to refer to a table in a relational database but is more commonly refers to
the relation between the different elements of a row, e.g.

• The relation is defined in a “schema”. It is the logical definition of a table


• The data is said to be structured.
• The data usually consists of simple types like integers, strings, floats etc.
Relational Database Management System
(RDBMS)
• Software that manages a collection
or database of tables.

• Is designed to support multi-user


access.

• Transactional Processing - Designed


for random access of table elements
for updation purposes as opposed
to batch processing.
Features of Relation Databases
• Records are organized into tables

• Rows of tables are identified by unique keys

• Data Spans multiple tables, which are linked by join operation

• Transactions are ACID-compliant


Structured Query Language -SQL
• Structured Query Language or SQL is a standard Database language which is used to create,
maintain and retrieve the data from relational databases

• Much more compact and expressive than programs written in standard programming languages
such as C++, Java etc. But only for tabular data stores.

Note: SQL has been found a very effective language for relational databases, hence they are so closely associated with
RDBMS systems. But there is no rule which states that SQL has to be used for RDBMS.
Denormalization
• Avoid Joins
• Expand number of columns
• Design table to include related
data
• Query a single table
• Improves read performance
• Introduces the possibility of data
anomalies
Creating shards
• Breaking up a database and storing
pieces of the database on different
servers.
• Uses multiple database instances
• Stores a subset of data
• Queries are read from a subset of
shards
• Improves read and write
performance
• Complex
Replication
• Makes copies of tables and
indexes
• Copies are stored on different
servers
• Any copy may be used to answer
a query
• Improves read performance
• Possibility of inconsistency
Not Only SQL (NoSQL) DatabasesHistory
• RDBMS found unsuitable to handle unstructured data
generated by the proliferation of the internet.
• Unstructured data includes: web pages, images, audio
clips, videos, documents (pdf, csv, text).
• There is a need to mine the data, hence a need to store
and manipulate the data in an efficient and organized
manner.
• Difficult to scale RDBMS on clusters.
• All the above gives rise to NoSQL databases around the
year 2000.
• Origins in Google’s BigTable and Amazon’s SimpleDB.

• Note: The “SQL” in the name “NoSQL” does not imply that these
category of databases do not or can not use SQL as the query
language.
What isNoSQL?
Non-relational data storage systems

No fixed table schema

No Joins
NoSQL

No multi-document transactions

Relaxes one or more ACID properties


NoSQL Database Types
• Key-Value Store.

• Document Store.

• Column Store.

• Graph databases
Key-Value Pair Store
• Key is unique.

• Value can be anything including a


document, an image etc.

• DBMS typically does not know anything


about the contents of the “value”.

• But database might allow storage of


metadata about the values.

• Application: online shopping information -


(user, user preferences)
Document Store
• Pair each key with a complex data
structure known as document
• Documents can contain many
different key-value pairs or key-array
pairs or even nested documents
• Support for embedded document
• Consumes more space as compared
to counterparts
• MongoDB is an example of this type
• Collection contains lots of document
• Each document can contain diverse
and heterogeneous field.
https://beginnersbook.com/2017/09/mapping-relational-databases-to-mongodb/
Graph stores
• Used to store information about
networks of data, such as social
networking connections
• Graph stores include Neo4J.
• Not very well suited for all sets
of problems
• Best suited for connected data
Wide Column Stores
• Store columns of data together
instead of rows
• Cassandra and Hbase are
optimized for queries over large
datasets
• Excellent for lookups on a single
field
• Lookup on other fields are not
supported
• Columns are not fixed
Types ofNoSQL

Key value data Column-oriented Document data Graph data


store data store store store

• Riak • Cassandra • MongoDB • InfiniteGraph


• Redis • HBase • CouchDB • Neo4
• Membase • HyperTable • RavenDB • Allegro Graph
NoSQL Vendors

Company Product Most widely used by

Amazon DynamoDB LinkedIn, Mozilla

Facebook Cassandra Netflix, Twitter, eBay

Google BigTable Adobe Photoshop


NoSQL Characteristics
Advantages ofNoSQL
Cheap, Easy to implement

Easy to distribute

Can easily scale up & down


Advantages of NoSQL
Relaxes the data consistency
requirement

Doesn’t require a pre-defined


schema

Data can be replicated to


multiple nodes and can be
partitioned
BASE Properties Has to do with
the “AP” of CAP.

• Basic Availability: The database appears to work most of the time (even if some nodes fail, or
packets are dropped).
• Soft-state: State changes even without input (to provide eventual consistency). Both have to do
with the “C” in
• Eventual consistency: Stores exhibit consistency at some later point. CAP.

BASE is a relaxed form of the CAP properties. NoSQL databases strive to satisfy the
BASE properties.

The BASE model is a flexible alternative (as is found acceptable with customer
shopping data) to the ACID model for databases that don't require strict adherence
to a relational model (as is required for banking data).
NoSQL Pros and Cons
Cons
Pros
• Not mature.
• Handles the diverse kind of data
generated by proliferation of the
internet. Flexible. • Do not provide same level of
guarantees (ACID properties) as
RDBMS systems.
• Designed to scale.
• Not transactional.
• Easier to maintain.
• Less secure.

• Not designed for typical business


intelligence applications.
SQL Vs.NoSQL
SQL NoSQL
Relational database Non-relational, distributed database
Relational model Model-less approach
Pre-defined schema Dynamic schema for unstructured data
Table based databases Document-based or graph-based or wide column store or key-value
pairs databases
Vertically scalable (by increasing system resources) Horizontally scalable (by creating a cluster of commodity machines)
Uses SQL Uses UnQL (Unstructured QueryLanguage)
Not preferred for large datasets Largely preferred for large datasets
Not a best fit for hierarchical data Best fit for hierarchical storage as it follows the key-value pair of
storing data similar to JSON (Java Script ObjectNotation)
Emphasis on ACID properties Follows Brewer’s CAP theorem
Excellent support from vendors Relies heavily on community support
Supports complex querying and data keeping needs Does not have good support for complex querying
Can be configured for strong consistency Few support strong consistency (e.g., MongoDB), few others can be
configured for eventual consistency (e.g., Cassandra)
Examples: Oracle, DB2, MySQL, MS SQL,PostgreSQL, MongoDB, HBase, Cassandra, Redis, Neo4j, CouchDB, Couchbase, Riak,
etc. etc.
NewSQL
Goal is to provide the scalabilityand flexibility of NoSQLdatabasesand the consistency of SQLdatabases

SQL interface for application interaction

ACID support for transactions

Characteristics of NewSQL An architecture that provides higher per node


performance vis-a-vs traditional RDBMS solution

Scale out, shared nothing architecture

Non-locking concurrency control mechanism so


that real time reads will not conflict with writes
SQL Vs. NoSQL Vs.NewSQL
SQL NoSQL NewSQL
Adherence to ACID Yes No Yes
properties
OLTP/OLAP Yes No Yes
Schema rigidity Yes No Maybe
Adherence to data model Adherence to
relational model
Data Format Flexibility No Yes Maybe
Scalability Scale up Scale out Scale out
Vertical Scaling Horizontal
Scaling
Distributed Computing Yes Yes Yes
Community Support Huge Growing Slowly
growing
Introduction to Hive
History of Hive

Hive 0.14
Hive 0.10 Hive 0.13
• Transaction with ACID
• Batch • Interactive
semantics
• Read –only Data • Read –only Data
• Cost Based Optimizer
• Hive QL • Substantial SQL
• SQL temporary tables
• MR • MR,TEZ
• MR, TEZ, Spark

Enterprise SQL at Hadoop Scale


Hive a Data Warehousing Tool
When to use hive
Meta Store in Hive (metastore)
• The Metastore stores the information about the tables, partitions, the
columns within the tables.
• There are 3 ways of storing in Metastore:
• Embedded Mode
• Local Mode
• Remote Mode
Embedded Metastore

• In this mode, the Metastore service run in the same JVM as Hive service and contains an embedded Derby
database instance backed by local disk. This mode required least configuration but support only 1 session at a
time. Therefore not suited for production.
Local meta store

In this mode, Metastore service run in the same JVM as Hive service, but Metastore
database run on separate process.
In this mode, Metastore service run on its own JVM. This brings better manageability and security because the
database tier can be completely fire walled off, and the clients no longer need the database credentials. In this,
Metastore service communicate with database over JDBC. Hadoop ecosystem software can communicate with
Hive using Thrift service.
Namespaces that separate tables
Database
and other data units
SQL HiveQL

Insert values row by row Insertion of bulk data(not single row at a time)

Update command is used Update command cannot be used

Delete command used to delete row or column Can not be used


Hive Query Language (HiveQL)
• It is HiveQL and not HQL.
• Based on SQL.
• Does not strictly follow the full SQL-92 standard.
• HiveQL offers extensions not in SQL including multitable inserts.
• Limited support for various SQL operations such as subqueries.
• Internally, a compiler translates HiveQL statements into a directed
acyclic graph of MapReduce, Tez, or Spark jobs, which are executed
on a distributed cluster.
Hive Query Language (HQL)
1. Create and manage tables and partitions.
2. Support various Relational, Arithmetic, and Logical Operators.
3. Evaluate functions.

4. Download the contents of a table to a local directory or result of queries to HDFS directory.
5. Large number of functions defined in Hive. Categorized as mathematical, Statistical, String, Date, Conditional,
Aggregate and so on.
We can retrive the list on hive shell by
hive> show function
Data Definition Language
• Build and modify the tables & other objects in the database
• Create/Drop/Alter Database
• Create/Drop/Truncate Table
• Alter Table/Partition/Column
• Create/Drop/Alter View
• Create/Drop/Alter Index
• Show
• Describe
Data Manipulation Language
• To receive
• Store
• Modify
• Delete
• Update data in database
Database
• To create a database named “STUDENTS” with comments and database properties.

• CREATE DATABASE IF NOT EXISTS STUDENTS


COMMENT 'STUDENT Details’
WITH DBPROPERTIES ('creator' = 'JOHN');

To describe a database. To drop database.

DESCRIBE DATABASE STUDENTS; DROP DATABASE STUDENTS;


Internal versus External Tables
Internal Table(Managed Table) External Table(Self Managed Table)
• Table data is stored in Hive managed HDFS • Table data is not managed by Hive and is
store. stored outside the warehouse.
• Dropping the table deletes the table • Dropping the table deletes the metadata but
metadata and data. not the data.
• Default create table • “External” is used, location need to be
specified
• One file is referred
• One file is referred by any number of tables,
by one table only
external references by location
To create managed table named ‘STUDENT’

CREATE TABLE IF NOT EXISTS student (rollno INT,name STRING,gpaFLOAT)


ROW FORMAT DELIMITED
FIELDS TERMINATED BY'\t';

To create external table named ‘EXT_STUDENT’

CREATE EXTERNAL TABLE IF NOT EXISTS ext_student(rollno INT,name STRING,gpa FLOAT)


ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t’
LOCATION ‘/STUDENT_INFO;
To load data into the table from file named student.tsv.

LOAD DATA LOCAL INPATH ‘/root/hivedemos/student.tsv’


OVERWRITE INTO TABLE ext_student;

To retrieve the student details from “EXT_STUDENT” table.

SELECT * FROM ext_student;


Partitioning - Prelimaries
Employee Name Employee ID Country
Big Table
Alok Nath 36554 India

Arun Thomas 36553 India


Break into smaller parts based on key
Geeta Rao 36555 India (“Country” in this case) and store in
separate units (can be files).
Susan Phillips 71222 UK

John Chambers 71225 UK


If data is required only from one part
(say India), access will be faster.

Breaking into too many small parts causes degradation of performance e.g. by employeeID.
Hashing - Prelimaries
Employee Name Employee ID Country Big Table Solution:
- map each key to a number e.g.
Alok Nath 36554 India
(empid modulo 2).
Arun Thomas 36553 India - 36554 mod 2 = 0
- 36553 mod 2 = 1
Geeta Rao 36555 India
- even EmpID mod 2 = 0
Susan Phillips 71222 UK - odd EmpID mod 2 = 1
- partition by above number.
John Chambers 71225 UK
- two partitions generated for above
Liam Neeson 80162 Ireland numbers, one with odd empid, other
with even empid.
Milo O’Shea 80233 Ireland
- Generating a partition number using
a function on a key is called Hashing.
Require to “partition” by empid. But do not want one
partition per key since it leads to too many partitions.
partitioning based on hashing is
Why “partition” by empid (but want small number of partitions) ? called hashPartitioning in Part 3.
Could be for joining two tables by empid (see example in Part 3).
Partitions
• Partitions split the larger dataset into more meaningful chunks.
• Partition improves i/o performance
• Hive provides two kinds of partitions:
• Static Partition
• Dynamic Partition.
Static Partitions

• Static Partition can be done on columns whose values are known at compile time
• create static partition based on “gpa” column.
CREATE TABLE IF NOT EXISTS static_part_student (rollno INT, name STRING)
PARTITIONED BY (gpa FLOAT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';

• Load data into partition table from table


INSERT OVERWRITE TABLE static_part_student PARTITION (gpa =4.0);
SELECT rollno, name from EXT_STUDENT where gpa=4.0;
Dynamic Partition

To create dynamic partition- The Column whose values are know only at execution time

CREATE TABLE IF NOT EXISTS dynamic_part_student(rollno INT, name STRING)


PARTITIONED BY (gpa FLOAT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';

• To load data into a dynamic partition table from table.

SET hive.exec.dynamic.partition = true;


SET hive.exec.dynamic.partition.mode = nonstrict;

Note: The dynamic partition strict mode requires at least one static partition column. To turn this off,
set hive.exec.dynamic.partition.mode=nonstrict

INSERT OVERWRITE TABLE DYNAMIC_PART_STUDENT


PARTITION (gpa);
SELECT rollno,name,gpa FROM ext_student;
Hive Partitions and Buckets
• Hive partitions are partitions generated by using keys which could be
column values or may not be.

• Hive buckets are “partitions” generated by hashing column values.


Hive Partitions
CREATE TABLE logs (ts BIGINT, line STRING)
Example: Partitions by date data was created and
PARTITIONED BY (dt STRING, country STRING);
further by country. Note: Date is not part of table.
LOAD DATA LOCAL INPATH ‘input/hive/partitions/file1’
INTO TABLE logs
PARTITION (dt=‘2001-01-01’, country=‘GB”)

• Separate folders / directories are


created per partition.

• Partition values may or may not be part


of the table data.
Hive Buckets
• Buckets are specified CREATE TABLE bucketed_users (id INT, name STRING)
using column names and CLUSTERED BY (id) INTO 4 BUCKETS;

number of buckets.
column on which to hash.
number of buckets into
which column entries
should be hashed.

can use id modulo


4 to hash.
• To create a bucketed table having 3 buckets.

CREATE TABLE IF NOT EXISTS student_bucket (rollnoINT,name STRING,grade FLOAT)


CLUSTERED BY (grade) into 3 buckets;

• Load data to bucketed table.

FROM STUDENT INSERT OVERWRITE TABLE student_bucket


SELECT rollno,name,grade;

• To display the content of first bucket.

SELECT DISTINCT grade FROM student_bucket TABLESAMPLE(BUCKET 1 OUT OF 3 ON GRADE);


Hive supports aggregation functions like avg, count, etc.

To write the average and count aggregation function.

SELECT avg(gpa) FROM STUDENT;

SELECT count(*) FROM STUDENT;

To write group by and having function.

SELECT rollno, name,gpa


FROM STUDENT
GROUP BY rollno,name,gpa
HAVING gpa > 4.0;
SerDe
• SerDe stands for Serialization and Deserialization.
• A serialization function converts complex in-memory data (for example a Java class object or Hive
Table) into a string to be stored on disk.
• The string is usually in compressed binary format for space savings.
• A deserialization function converts the string back into the complex data structure in-memory.
• Since a string is a “uniform sequential” or serial data form as opposed to a complex data
structure, this conversion is known as serialization.
• Hive can use SerDe functions to read / write its data from HDFS efficiently.
• Custom SerDe can be used.
• Note: RCFile storage uses SerDe to compress column data.
User Defined Functions
User Defined Functions allow customization of Hive queries.

1. Create a Java class for the User Defined Function, public final class MyUpperCase extends UDF {
Class must extend UDF abstract class public string evaluate(final String word) {
return word.toUpperCase
2. Class must have one or more evaluate() }
methods. Put in your desired logic. }

3. Compile the java file.


hive> ADD JAR UpperCase.jar;
4. Package your Java class into a JAR file.
hive> CREATE TEMPORARY FUNCTION toUpperCase AS
5. Go to Hive CLI and add your JAR. MyUpperCase;

6. CREATE TEMPORARY FUNCTION in Hive which


points to your Java class. hive> SELECT toUpperCase(name) FROM STUDENT;

• Use it in Hive SQL ! Note: The syntax of the Hive commands above are not meant to be complete
and are for illustration purposes only.
End of Unit 5

You might also like