Bda Unit-5 PDF
Bda Unit-5 PDF
Topics
• Introduction to NoSQL Databases
• Introduction to Hive
Relational Databases
• A relational database refers to a database that stores data in a structured format, using rows and
columns. This makes it easy to locate and access specific values within the database
• Relation is sometimes used to refer to a table in a relational database but is more commonly refers to
the relation between the different elements of a row, e.g.
• Much more compact and expressive than programs written in standard programming languages
such as C++, Java etc. But only for tabular data stores.
Note: SQL has been found a very effective language for relational databases, hence they are so closely associated with
RDBMS systems. But there is no rule which states that SQL has to be used for RDBMS.
Denormalization
• Avoid Joins
• Expand number of columns
• Design table to include related
data
• Query a single table
• Improves read performance
• Introduces the possibility of data
anomalies
Creating shards
• Breaking up a database and storing
pieces of the database on different
servers.
• Uses multiple database instances
• Stores a subset of data
• Queries are read from a subset of
shards
• Improves read and write
performance
• Complex
Replication
• Makes copies of tables and
indexes
• Copies are stored on different
servers
• Any copy may be used to answer
a query
• Improves read performance
• Possibility of inconsistency
Not Only SQL (NoSQL) DatabasesHistory
• RDBMS found unsuitable to handle unstructured data
generated by the proliferation of the internet.
• Unstructured data includes: web pages, images, audio
clips, videos, documents (pdf, csv, text).
• There is a need to mine the data, hence a need to store
and manipulate the data in an efficient and organized
manner.
• Difficult to scale RDBMS on clusters.
• All the above gives rise to NoSQL databases around the
year 2000.
• Origins in Google’s BigTable and Amazon’s SimpleDB.
• Note: The “SQL” in the name “NoSQL” does not imply that these
category of databases do not or can not use SQL as the query
language.
What isNoSQL?
Non-relational data storage systems
No Joins
NoSQL
No multi-document transactions
• Document Store.
• Column Store.
• Graph databases
Key-Value Pair Store
• Key is unique.
Easy to distribute
• Basic Availability: The database appears to work most of the time (even if some nodes fail, or
packets are dropped).
• Soft-state: State changes even without input (to provide eventual consistency). Both have to do
with the “C” in
• Eventual consistency: Stores exhibit consistency at some later point. CAP.
BASE is a relaxed form of the CAP properties. NoSQL databases strive to satisfy the
BASE properties.
The BASE model is a flexible alternative (as is found acceptable with customer
shopping data) to the ACID model for databases that don't require strict adherence
to a relational model (as is required for banking data).
NoSQL Pros and Cons
Cons
Pros
• Not mature.
• Handles the diverse kind of data
generated by proliferation of the
internet. Flexible. • Do not provide same level of
guarantees (ACID properties) as
RDBMS systems.
• Designed to scale.
• Not transactional.
• Easier to maintain.
• Less secure.
Hive 0.14
Hive 0.10 Hive 0.13
• Transaction with ACID
• Batch • Interactive
semantics
• Read –only Data • Read –only Data
• Cost Based Optimizer
• Hive QL • Substantial SQL
• SQL temporary tables
• MR • MR,TEZ
• MR, TEZ, Spark
• In this mode, the Metastore service run in the same JVM as Hive service and contains an embedded Derby
database instance backed by local disk. This mode required least configuration but support only 1 session at a
time. Therefore not suited for production.
Local meta store
In this mode, Metastore service run in the same JVM as Hive service, but Metastore
database run on separate process.
In this mode, Metastore service run on its own JVM. This brings better manageability and security because the
database tier can be completely fire walled off, and the clients no longer need the database credentials. In this,
Metastore service communicate with database over JDBC. Hadoop ecosystem software can communicate with
Hive using Thrift service.
Namespaces that separate tables
Database
and other data units
SQL HiveQL
Insert values row by row Insertion of bulk data(not single row at a time)
4. Download the contents of a table to a local directory or result of queries to HDFS directory.
5. Large number of functions defined in Hive. Categorized as mathematical, Statistical, String, Date, Conditional,
Aggregate and so on.
We can retrive the list on hive shell by
hive> show function
Data Definition Language
• Build and modify the tables & other objects in the database
• Create/Drop/Alter Database
• Create/Drop/Truncate Table
• Alter Table/Partition/Column
• Create/Drop/Alter View
• Create/Drop/Alter Index
• Show
• Describe
Data Manipulation Language
• To receive
• Store
• Modify
• Delete
• Update data in database
Database
• To create a database named “STUDENTS” with comments and database properties.
Breaking into too many small parts causes degradation of performance e.g. by employeeID.
Hashing - Prelimaries
Employee Name Employee ID Country Big Table Solution:
- map each key to a number e.g.
Alok Nath 36554 India
(empid modulo 2).
Arun Thomas 36553 India - 36554 mod 2 = 0
- 36553 mod 2 = 1
Geeta Rao 36555 India
- even EmpID mod 2 = 0
Susan Phillips 71222 UK - odd EmpID mod 2 = 1
- partition by above number.
John Chambers 71225 UK
- two partitions generated for above
Liam Neeson 80162 Ireland numbers, one with odd empid, other
with even empid.
Milo O’Shea 80233 Ireland
- Generating a partition number using
a function on a key is called Hashing.
Require to “partition” by empid. But do not want one
partition per key since it leads to too many partitions.
partitioning based on hashing is
Why “partition” by empid (but want small number of partitions) ? called hashPartitioning in Part 3.
Could be for joining two tables by empid (see example in Part 3).
Partitions
• Partitions split the larger dataset into more meaningful chunks.
• Partition improves i/o performance
• Hive provides two kinds of partitions:
• Static Partition
• Dynamic Partition.
Static Partitions
• Static Partition can be done on columns whose values are known at compile time
• create static partition based on “gpa” column.
CREATE TABLE IF NOT EXISTS static_part_student (rollno INT, name STRING)
PARTITIONED BY (gpa FLOAT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';
To create dynamic partition- The Column whose values are know only at execution time
Note: The dynamic partition strict mode requires at least one static partition column. To turn this off,
set hive.exec.dynamic.partition.mode=nonstrict
number of buckets.
column on which to hash.
number of buckets into
which column entries
should be hashed.
1. Create a Java class for the User Defined Function, public final class MyUpperCase extends UDF {
Class must extend UDF abstract class public string evaluate(final String word) {
return word.toUpperCase
2. Class must have one or more evaluate() }
methods. Put in your desired logic. }
• Use it in Hive SQL ! Note: The syntax of the Hive commands above are not meant to be complete
and are for illustration purposes only.
End of Unit 5