0% found this document useful (0 votes)

24 views374 pages

Big Data Unit 3

The document provides an overview of NoSQL databases, highlighting their differences from relational databases, including schema flexibility, consistency models, and transaction support. It discusses the history, features, types (document, graph, key-value, columnar), and benefits of NoSQL databases, as well as the CAP theorem and BASE properties. Additionally, it addresses eventual consistency and includes references for further reading.

Uploaded by

rahul104941

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views374 pages

Big Data Unit 3

Uploaded by

rahul104941

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 374

Apex Institute of Technology

Department of Computer Science & Engineering

Bachelor of Engineering (Computer Science & Engineering)
INTRODUCTION TO BDA– (21CST-246)
Prepared By: Dr. Geeta Rani (E15227)

DISCOVER . LEARN . EMPOWER

1
NoSQL
SQL vs NoSQL Databases
NoSQL Database Relational Database
NoSQL Database supports a very simple query language. Relational Database supports a powerful query language.

NoSQL Database has no fixed schema. Relational Database has a fixed schema.

NoSQL Database is only eventually consistent. Relational Database follows acid properties. (Atomicity,
Consistency, Isolation, and Durability)
NoSQL databases don't support transactions (support only simple Relational Database supports transactions (also complex
transactions). transactions with joins).
NoSQL Database is used to handle data coming in high velocity. Relational Database is used to handle data coming in low
velocity.
The NoSQL?s data arrive from many locations. Data in relational database arrive from one or few locations.

NoSQL database can manage structured, unstructured and semi- Relational database manages only structured data.
structured data.
NoSQL databases have no single point of failure. Relational databases have a single point of failure with failover.

NoSQL databases can handle big data or data in a very high NoSQL databases are used to handle moderate volume of data.
volume .
NoSQL has decentralized structure. Relational database has centralized structure.
Brief History of NoSQL Databases

• 1998- Carlo Strozzi use the term NoSQL for his lightweight, open-
source relational database
• 2000- Graph database Neo4j is launched
• 2004- Google BigTable is launched
• 2005- CouchDB is launched
• 2007- The research paper on Amazon Dynamo is released
• 2008- Facebooks open sources the Cassandra project
• 2009- The term NoSQL was reintroduced
Features of NoSQL
Non-relational
• NoSQL databases never follow the relational
model
• Never provide tables with flat fixed-column
records
• Work with self-contained aggregates or BLOBs
• Doesn’t require object-relational mapping and
data normalization
• No complex features like query languages, query
planners, referential integrity joins, ACID
Schema-free

• NoSQL databases are either schema-free or have relaxed schemas

• Do not require any sort of definition of the schema of the data
• Offers heterogeneous structures of data in the same domain
Simple API
• Offers easy to use interfaces for storage and querying data provided
• APIs allow low-level data manipulation & selection methods
• Text-based protocols mostly used with HTTP REST with JSON
• Mostly used no standard based NoSQL query language
• Web-enabled databases running as internet-facing services
Distributed
• Multiple NoSQL databases can be executed in a distributed fashion
• Offers auto-scaling and fail-over capabilities
• Often ACID concept can be sacrificed for scalability and throughput
• Mostly no synchronous replication between distributed nodes
Asynchronous Multi-Master Replication, peer-to-peer, HDFS
Replication
• Only providing eventual consistency
• Shared Nothing Architecture. This enables less coordination and
higher distribution.
Types of NoSQL Databases
 Here is a limited taxonomy of NoSQL databases:

NoSQL Databases

Document Graph Key-Value Columnar

Stores Databases Stores Databases
Types of NoSQL Databases
Document Stores
 Document-Oriented NoSQL DB stores and retrieves data as a
key value pair but the value part is stored as a document. The
document is stored in JSON or XML formats. The value is
understood by the DB and can be queried.
 Documents are stored in some standard format or encoding
(e.g., XML, JSON, PDF or Office Documents)
 These are typically referred to as Binary Large Objects (BLOBs)
 Documents can be indexed
 This allows document stores to outperform traditional file
systems
• The document type is mostly used for CMS systems, blogging
platforms, real-time analytics & e-commerce applications. It
should not use for complex transactions which require
multiple operations or queries against varying aggregate
structures.
• Amazon SimpleDB, CouchDB, MongoDB, Riak, Lotus Notes,
MongoDB, are popular Document originated DBMS systems
Types of NoSQL Databases
 Here is a limited taxonomy of NoSQL databases:

NoSQL Databases

Document Graph Key-Value Columnar

Stores Databases Stores Databases
Graph Databases
• A graph type database stores entities as well the relations amongst
those entities. The entity is stored as a node with the relationship as
edges. An edge gives a relationship between nodes. Every node and
edge has a unique identifier.
• Graph base database mostly used for social networks, logistics,
spatial data.
• Neo4J, Infinite Graph, OrientDB, FlockDB are some popular graph-
based databases.
Graph Databases
 Data are represented as vertices and edges
Id: 2
Name: Bob
Age: 22

Id: 1
Name:
Alice
Age: 18

Id: 3
Name:
Chess
Type:
Group

 Graph databases are powerful for graph-like queries (e.g., find

the shortest path between two elements)

 E.g., Neo4j and VertexDB

Types of NoSQL Databases
 Here is a limited taxonomy of NoSQL databases:

NoSQL Databases

Document Graph Key-Value Columnar

Stores Databases Stores Databases
Key-Value Stores
 Keys are mapped to (possibly) more complex value
(e.g., lists)

 Keys can be stored in a hash table and can be

distributed easily
 It is designed in such a way to handle lots of data and
heavy load.
 Key-value pair storage databases store data as a hash
table where each key is unique, and the value can be
a JSON, BLOB(Binary Large Objects), string, etc.
Key-Value Stores
 Such stores typically support regular CRUD (create, read, update,
and delete) operations
 That is, no joins and aggregate functions

 E.g., Amazon DynamoDB and Apache Cassandra

Key-Value Stores
Converting relational Database into Key value
pair Database
• set emp_details.first_name.01 "John"
• set emp_details.last_name.01 "Newman"
• set emp_details.address.01 "New York"
• set emp_details.first_name.02 "Michael"
• set emp_details.last_name.02 "Clarke"
• set emp_details.address.02 "Melbourne"
• set emp_details.first_name.03 "Steve“
• set emp_details.last_name.03 "Smith"
• set emp_details.address.03 "Los Angeles"
Types of NoSQL Databases
 Here is a limited taxonomy of NoSQL databases:

NoSQL Databases

Document Graph Key-Value Columnar

Stores Databases Stores Databases
• Column-oriented databases work on columns and are based on
BigTable paper by Google. Every column is treated separately.
Values of single column databases are stored contiguously.
• They deliver high performance on aggregation queries like SUM,
COUNT, AVG, MIN etc. as the data is readily available in a column.
• Column-based NoSQL databases are widely used to manage data
warehouses, business intelligence, CRM, Library card catalogs,
• HBase, Cassandra, HBase, Hypertable are NoSQL query examples of
column based database.
Columnar Databases
 Columnar databases are a hybrid of RDBMSs and Key-
Value stores
 Values are stored in groups of zero or more columns, but in
Column-Order (as opposed to Row-Order)
Record 1 Column A Column A = Group A

Alice 3 25 Bob Alice Bob Carol Alice Bob Carol

4 19 Carol 0 3 4 0 25 3 25 4 19
45 19 45 0 45
Column Family {B, C}
Row-Order Columnar (or Column-Order) Columnar with Locality Groups

 Values are queried by matching keys

 E.g., HBase and Vertica

• More specifically, column
databases use the concept of
keyspace, which is sort of like a
schema in relational models.
This keyspace contains all the
column families, which then
contain rows, which then
contain columns
Benefits of Column Databases

• Column stores are excellent at compression and therefore are efficient in terms of
storage. This means you can reduce disk resources while holding massive amounts of
information in a single column
• Since a majority of the information is stored in a column, aggregation queries are quite
fast, which is important for projects that require large amounts of queries in a small
amount of time.
• Scalability is excellent with column-store databases. They can be expanded nearly
infinitely, and are often spread across large clusters of machines, even numbering in
thousands. That also means that they are great for Massive Parallel Processing
• Load times are similarly excellent, as you can easily load a billion-row table in a few
seconds. That means you can load and query nearly instantly.
• Large amounts of flexibility as columns do not necessarily have to look like each other.
That means you can add new and different columns without disrupting the whole
database. That being said, entering completely new record queries requires a change
to all tables.
The CAP Theorem
 The limitations of distributed databases can be described
in the so called the CAP theorem
 Consistency: every node always sees the same data at any
given instance (i.e., strict consistency)

 Availability: the system continues to operate, even if nodes

in a cluster crash, or some hardware or software parts are
down due to upgrades

 Partition Tolerance: the system continues to operate in the

presence of network partitions

CAP theorem: any distributed database with shared data, can have at
most two of the three desirable properties, C, A or P
The CAP Theorem (Cont’d)
 Let us assume two nodes on opposite sides of a
network partition:

 Availability + Partition Tolerance forfeit Consistency

 Consistency + Partition Tolerance entails that one side of

the partition must act as if it is unavailable, thus
forfeiting Availability

 Consistency + Availability is only possible if there is no

network partition, thereby forfeiting Partition Tolerance
Large-Scale Databases
 When companies such as Google and Amazon were
designing large-scale databases, 24/7 Availability was a
key
 A few minutes of downtime means lost revenue

 When horizontally scaling databases to 1000s of

machines, the likelihood of a node or a network failure
increases tremendously

 Therefore, in order to have strong guarantees on

Availability and Partition Tolerance, they had to sacrifice
“strict” Consistency (implied by the CAP theorem)
Trading-Off Consistency
 Maintaining consistency should balance between the
strictness of consistency versus availability/scalability
 Good-enough consistency depends on your application
Trading-Off Consistency
 Maintaining consistency should balance between the
strictness of consistency versus availability/scalability
 Good-enough consistency depends on your application

Loose Consistency Strict Consistency

Easier to implement, and is Generally hard to implement, and is

efficient inefficient
The BASE Properties
 The CAP theorem proves that it is impossible to guarantee
strict Consistency and Availability while being able to
tolerate network partitions

 This resulted in databases with relaxed ACID guarantees

 In particular, such databases apply the BASE properties:

 Basically Available: the system guarantees Availability
 Soft-State: the state of the system may change over time
 Eventual Consistency: the system will eventually
become consistent
Eventual Consistency
 A database is termed as Eventually Consistent if:
 All replicas will gradually become consistent in the
absence of updates
Eventual Consistency
 A database is termed as Eventually Consistent if:
 All replicas will gradually become consistent in the
absence of updates

Webpage-A
Webpage-A Webpage-A

Event: Update
Webpage-A Webpage-A
Webpage-A

Webpage-A
Eventual Consistency:
A Main Challenge
 But, what if the client accesses the data from
different replicas?

Webpage-A
Webpage-A Webpage-A

Event: Update
Webpage-A Webpage-A
Webpage-A

Webpage-A

Protocols like Read Your Own Writes (RYOW) can be

applied!
Q/A
• What does NoSQL stand for?
• a) Not Only SQL
• b) Non-SQL
• c) No Structured Query Language
• d) Non-Sequential Query Logic

43
Q/A
• Which of the following is a characteristic of NoSQL databases?
• a) They use a fixed schema for data storage.
• b) They are only suitable for small-scale applications.
• c) They provide ACID (Atomicity, Consistency, Isolation, Durability)
transactions.
• d) They offer flexible and scalable data models.

44
Q/A
Which type of data model is commonly used in NoSQL databases?
a) Relational model
b) Document model
c) Entity-relationship model
d) Hierarchical model

45
References:

✔https://www.edureka.co/blog/big-data-tutorial
✔https://www.coursera.org/learn/big-data-introduction?specialization=big-data2.
✔https://www.coursera.org/learn/fundamentals-of-big-data
✔Big Data, Black Book: Covers Hadoop 2, MapReduce, Hive, YARN, Pig, R and Data Visualization, DT Editorial
Service, Dreamtech Press
✔Big Data Analytics, Subhashini Chellappa, Seema Acharya, Wiley publications
✔Big Data: Concepts, Technology, and Architecture, Nandhini Abirami R , Seifedine Kadry, Amir H. Gandomi ,
Wiley publication

8/8/2021 46
THANK YOU

47
Apex Institute of Technology
Department of Computer Science & Engineering
Bachelor of Engineering (Computer Science & Engineering)
INTRODUCTION TO BDA– (21CST-246)
Prepared By: Dr. Geeta Rani (E15227)

DISCOVER . LEARN . EMPOWER

1
NoSQL
SQL vs NoSQL Databases
NoSQL Database Relational Database
NoSQL Database supports a very simple query language. Relational Database supports a powerful query language.

NoSQL Database has no fixed schema. Relational Database has a fixed schema.

• NoSQL databases are either schema-free or have relaxed schemas

NoSQL Databases

Document Graph Key-Value Columnar

NoSQL Databases

Document Graph Key-Value Columnar

Id: 1
Name:
Alice
Age: 18

Id: 3
Name:
Chess
Type:
Group

 Graph databases are powerful for graph-like queries (e.g., find

the shortest path between two elements)

 E.g., Neo4j and VertexDB

Types of NoSQL Databases
 Here is a limited taxonomy of NoSQL databases:

NoSQL Databases

Document Graph Key-Value Columnar

Stores Databases Stores Databases
Key-Value Stores
 Keys are mapped to (possibly) more complex value
(e.g., lists)

 Keys can be stored in a hash table and can be

 E.g., Amazon DynamoDB and Apache Cassandra

NoSQL Databases

Document Graph Key-Value Columnar

Alice 3 25 Bob Alice Bob Carol Alice Bob Carol

4 19 Carol 0 3 4 0 25 3 25 4 19
45 19 45 0 45
Column Family {B, C}
Row-Order Columnar (or Column-Order) Columnar with Locality Groups

 Values are queried by matching keys

 E.g., HBase and Vertica

 Availability: the system continues to operate, even if nodes

in a cluster crash, or some hardware or software parts are
down due to upgrades

 Partition Tolerance: the system continues to operate in the

presence of network partitions

 Availability + Partition Tolerance forfeit Consistency

 Consistency + Partition Tolerance entails that one side of

the partition must act as if it is unavailable, thus
forfeiting Availability

 Consistency + Availability is only possible if there is no

 When horizontally scaling databases to 1000s of

machines, the likelihood of a node or a network failure
increases tremendously

 Therefore, in order to have strong guarantees on

Loose Consistency Strict Consistency

Easier to implement, and is Generally hard to implement, and is

efficient inefficient
The BASE Properties
 The CAP theorem proves that it is impossible to guarantee
strict Consistency and Availability while being able to
tolerate network partitions

 This resulted in databases with relaxed ACID guarantees

 In particular, such databases apply the BASE properties:

Webpage-A
Webpage-A Webpage-A

Event: Update
Webpage-A Webpage-A
Webpage-A

Webpage-A
Eventual Consistency:
A Main Challenge
 But, what if the client accesses the data from
different replicas?

Webpage-A
Webpage-A Webpage-A

Event: Update
Webpage-A Webpage-A
Webpage-A

Webpage-A

Protocols like Read Your Own Writes (RYOW) can be

applied!
Q/A
Which NoSQL database is known for its high scalability and fault
tolerance?
a) Cassandra
b) Redis
c) CouchDB
d) Neo4j
Ans : a) Cassandra

43
Which NoSQL database is optimized for handling large graphs and
complex relationships?
a) Cassandra
b) Redis
c) CouchDB
d) Neo4j
Ans : d) Neo4j

44
Q/A
Which of the following is an example of a NoSQL database?
a) MySQL
b) PostgreSQL
c) MongoDB
d) Oracle Database

Ans: c) MongoDB

45
References:

8/8/2021 46
THANK YOU

DISCOVER . LEARN . EMPOWER

1
Types of NoSQL DataBases
Types of NoSQL Databases
 Here is a limited taxonomy of NoSQL databases:

NoSQL Databases

Document Graph Key-Value Columnar

NoSQL Databases

Document Graph Key-Value Columnar

Id: 1
Name:
Alice
Age: 18

Id: 3
Name:
Chess
Type:
Group

 Graph databases are powerful for graph-like queries (e.g., find

the shortest path between two elements)

 E.g., Neo4j and VertexDB

Types of NoSQL Databases
 Here is a limited taxonomy of NoSQL databases:

NoSQL Databases

Document Graph Key-Value Columnar

Stores Databases Stores Databases
Key-Value Stores
 Keys are mapped to (possibly) more complex value
(e.g., lists)

 Keys can be stored in a hash table and can be

 E.g., Amazon DynamoDB and Apache Cassandra

NoSQL Databases

Document Graph Key-Value Columnar

Alice 3 25 Bob Alice Bob Carol Alice Bob Carol

4 19 Carol 0 3 4 0 25 3 25 4 19
45 19 45 0 45
Column Family {B, C}
Row-Order Columnar (or Column-Order) Columnar with Locality Groups

 Values are queried by matching keys

 E.g., HBase and Vertica

 Availability: the system continues to operate, even if nodes

in a cluster crash, or some hardware or software parts are
down due to upgrades

 Partition Tolerance: the system continues to operate in the

presence of network partitions

 Availability + Partition Tolerance forfeit Consistency

 Consistency + Partition Tolerance entails that one side of

the partition must act as if it is unavailable, thus
forfeiting Availability

 Consistency + Availability is only possible if there is no

 When horizontally scaling databases to 1000s of

machines, the likelihood of a node or a network failure
increases tremendously

 Therefore, in order to have strong guarantees on

Loose Consistency Strict Consistency

Easier to implement, and is Generally hard to implement, and is

efficient inefficient
The BASE Properties
 The CAP theorem proves that it is impossible to guarantee
strict Consistency and Availability while being able to
tolerate network partitions

 This resulted in databases with relaxed ACID guarantees

 In particular, such databases apply the BASE properties:

Webpage-A
Webpage-A Webpage-A

Event: Update
Webpage-A Webpage-A
Webpage-A

Webpage-A
Eventual Consistency:
A Main Challenge
 But, what if the client accesses the data from
different replicas?

Webpage-A
Webpage-A Webpage-A

Event: Update
Webpage-A Webpage-A
Webpage-A

Webpage-A

Protocols like Read Your Own Writes (RYOW) can be

applied!
Q/A
Which NoSQL database provides a key-value data model?
a) MongoDB
b) Cassandra
c) CouchDB
d) Redis

• Ans : d) Redis

32
Q/A
Which NoSQL database is suitable for storing semi-structured or
unstructured data, such as JSON documents?
a) MongoDB
b) Cassandra
c) CouchDB
d) Redis

33
Q/A
Which type of NoSQL database is suitable for storing hierarchical
data?
a) Document database
b) Columnar database
c) Key-value store
d) Graph database

Ans :d) Graph database

34
References:

8/8/2021 35
THANK YOU

36
Apex Institute of Technology
Department of Computer Science & Engineering
Bachelor of Engineering (Computer Science & Engineering)
INTRODUCTION TO BDA– (21CST-246)
Prepared By: Dr. Geeta Rani (E15227)

DISCOVER . LEARN . EMPOWER

1
NoSQL Databases Properties
The CAP Theorem
 The limitations of distributed databases can be described
in the so called the CAP theorem
 Consistency: every node always sees the same data at any
given instance (i.e., strict consistency)

 Availability: the system continues to operate, even if nodes

in a cluster crash, or some hardware or software parts are
down due to upgrades

 Partition Tolerance: the system continues to operate in the

presence of network partitions

 Availability + Partition Tolerance forfeit Consistency

 Consistency + Partition Tolerance entails that one side of

the partition must act as if it is unavailable, thus
forfeiting Availability

 Consistency + Availability is only possible if there is no

 When horizontally scaling databases to 1000s of

machines, the likelihood of a node or a network failure
increases tremendously

 Therefore, in order to have strong guarantees on

Loose Consistency Strict Consistency

Easier to implement, and is Generally hard to implement, and is

efficient inefficient
The BASE Properties
 The CAP theorem proves that it is impossible to guarantee
strict Consistency and Availability while being able to
tolerate network partitions

 This resulted in databases with relaxed ACID guarantees

 In particular, such databases apply the BASE properties:

Webpage-A
Webpage-A Webpage-A

Event: Update
Webpage-A Webpage-A
Webpage-A

Webpage-A
Eventual Consistency:
A Main Challenge
 But, what if the client accesses the data from
different replicas?

Webpage-A
Webpage-A Webpage-A

Event: Update
Webpage-A Webpage-A
Webpage-A

Webpage-A

Protocols like Read Your Own Writes (RYOW) can be

applied!
Q/A
Which property of NoSQL databases allows for flexible and dynamic
schema designs?
a) ACID compliance
b) Strong data consistency
c) Horizontal scalability
d) Schema flexibility

12
Q/A
Which property of NoSQL databases enables them to handle large
amounts of data and high traffic loads?
a) ACID compliance
b) Strong data consistency
c) Horizontal scalability
d) Schema flexibility

13
Q/A
Which property of NoSQL databases allows for distributed data
storage across multiple servers?
a) ACID compliance
b) Strong data consistency
c) Horizontal scalability
d) Schema flexibility

14
References:

8/8/2021 15
THANK YOU

16
Apex Institute of Technology
Department of Computer Science & Engineering
Bachelor of Engineering (Computer Science & Engineering)
INTRODUCTION TO BDA– (21CST-246)
Prepared By: Dr. Geeta Rani (E15227)

DISCOVER . LEARN . EMPOWER

1
HBase
Introduction
HBase is
• an open-source,
• column-oriented
• distributed database system in a Hadoop environment.
• Initially, it was Google Big Table,
• afterward; it was renamed as HBase and is primarily written in
Java.
• Apache HBase is needed for real-time Big Data applications.
• HBase is built for low latency operations
Apache HBase Features

• HBase is built for low latency operations

• HBase is used extensively for random read and write operations
• HBase stores a large amount of data in terms of tables
• Provides linear and modular scalability over cluster environment
• Strictly consistent to read and write operations
• Automatic and configurable sharding of tables
• Automatic failover supports between Region Servers
• Convenient base classes for backing Hadoop MapReduce jobs in HBase tables
• Easy to use Java API for client access
• Block cache and Bloom Filters for real-time queries
Which NoSQL Database to choose
DataBase Type Based on Feature Example of Database Use case (When to Use)

Caching, Queue-ing, Distributing

Key/ Value Redis, MemcacheDB
information

Scaling, Keeping Unstructured, non-

Column-Oriented Cassandra, HBase
volatile

Document-Oriented MongoDB, Couchbase Nested Information, JavaScript friendly

Handling Complex relational information.

Graph-Based OrientDB, Neo4J
Modeling and Handling classification.
HBase Vs. RDBMS
HBASE RDBMS

•Schema-less in database •Having fixed schema in database

•Column-oriented databases •Row oriented datastore

•Designed to store De-normalized data •Designed to store Normalized data

•Wide and sparsely populated tables present in HBase •Contains thin tables in database

•Supports automatic partitioning •Has no built in support for partitioning

•Well suited for OLAP systems •Well suited for OLTP systems
•Retrieve one row at a time and hence could read unnecessary
•Read only relevant data from database
data if only some of the data in a row is required
•Structured and semi-structure data can be stored and
•Structured data can be stored and processed using RDBMS
processed using HBase
•Enables aggregation over many rows and columns •Aggregation is an expensive operation
HBase Vs. Hive
Features HBase Hive

DataBase model Wide Column store Relational DBMS

Data Schema Schema- free With Schema

SQL Support No Yes, it uses HQL(Hive query language)

Partition methods Sharding Sharding

Consistency Level Immediate Consistency Eventual Consistency

Secondary indexes No Yes

Replication Methods Selectable replication factor Selectable replication factor

Row-oriented vs column-oriented Databases:
• Row-oriented databases store table records in a sequence of rows.
Whereas column-oriented databases store table records in a
sequence of columns, i.e. the entries in a column are stored in
contiguous locations on disks.
• To better understand it, let us take an example and consider the
table below.
Row-oriented vs column-oriented
Databases:
• If this table is stored in a row-oriented database. It will store the records
as shown below:
• 1, Paul Walker, US, 231, Gallardo,
• 2, Vin Diesel, Brazil, 520, Mustang
• In row-oriented databases data is stored on the basis of rows or tuples as
you can see above.
• While the column-oriented databases store this data as:
• 1,2, Paul Walker, Vin Diesel, US, Brazil, 231, 520, Gallardo, Mustang
• In a column-oriented databases, all the column values are stored
together like first column values will be stored together, then the second
column values will be stored together and data in other columns are
stored in a similar manner.
Row-oriented vs column-oriented
Databases:
• When the amount of data is very huge, like in terms of petabytes or
exabytes, we use column-oriented approach, because the data of a single
column is stored together and can be accessed faster.
• While row-oriented approach comparatively handles less number of rows
and columns efficiently, as row-oriented database stores data is a
structured format.
• When we need to process and analyze a large set of semi-structured or
unstructured data, we use column oriented approach. Such as
applications dealing with Online Analytical Processing like data mining,
data warehousing, applications including analytics, etc.
• Whereas, Online Transactional Processing such as banking and finance
domains which handle structured data and require transactional
properties (ACID properties) use row-oriented approach.
Column-oriented Databases
• Tables: Data is stored in a table format in HBase. But here tables are in column-
oriented format.
• Row Key: Row keys are used to search records which make searches fast. You would
be curious to know how? I will explain it in the architecture part moving ahead in this
blog.
• Column Families: Various columns are combined in a column family. These column
families are stored together which makes the searching process faster because data
belonging to same column family can be accessed together in a single seek.
• Column Qualifiers: Each column’s name is known as its column qualifier.
• Cell: Data is stored in cells. The data is dumped into cells which are specifically
identified by rowkey and column qualifiers.
• Timestamp: Timestamp is a combination of date and time. Whenever data is stored, it
is stored with its timestamp. This makes easy to search for a particular version of data.
HBase Contains
• Set of tables
• Each table with column families and rows
• Row key acts as a Primary key in HBase.
• Any access to HBase tables uses this Primary Key
• Each column qualifier present in HBase denotes attribute
corresponding to the object which resides in the cell.
HBase Architecture
• Region
• A region contains all the rows between the start key and the end key
assigned to that region.
• HBase tables can be divided into a number of regions in such a way that all
the columns of a column family is stored in one region.
• Each region contains the rows in a sorted order.
• Region Server
• Many regions are assigned to a Region Server
• It is responsible for handling, managing, executing reads and writes
operations on that set of regions.
HBase Architecture
• A table can be divided into a number of regions. A Region is a sorted
range of rows storing data between a start key and an end key.
• A Region has a default size of 256MB which can be configured
according to the need.
• A Group of regions is served to the clients by a Region Server.
• A Region Server can serve approximately 1000 regions to the client.
HBase Architecture: HMaster
HBase Architecture: HMaster
• HBase HMaster performs DDL operations (create and delete tables) and
assigns regions to the Region servers as you can see in the above image.
• It coordinates and manages the Region Server (similar as NameNode
manages DataNode in HDFS).
• It assigns regions to the Region Servers on startup and re-assigns regions
to Region Servers during recovery and load balancing.
• It monitors all the Region Server’s instances in the cluster (with the help
of Zookeeper) and performs recovery activities whenever any Region
Server is down.
• It provides an interface for creating, deleting and updating tables.
HBase Architecture: ZooKeeper – The
Coordinator
• Zookeeper acts like a coordinator inside HBase distributed environment. It helps in maintaining
server state inside the cluster by communicating through sessions.
• Every Region Server along with HMaster Server sends continuous heartbeat at regular interval
to Zookeeper and it checks which server is alive and available as mentioned in above image. It
also provides server failure notifications so that, recovery measures can be executed.
• Referring from the above image you can see, there is an inactive server, which acts as a backup
for active server. If the active server fails, it comes for the rescue.
• The active HMaster sends heartbeats to the Zookeeper while the inactive HMaster listens for
the notification send by active HMaster. If the active HMaster fails to send a heartbeat the
session is deleted and the inactive HMaster becomes active.
• While if a Region Server fails to send a heartbeat, the session is expired and all listeners are
notified about it. Then HMaster performs suitable recovery actions which we will discuss later
in this blog.
• Zookeeper also maintains the .META Server’s path, which helps any client in searching for any
region. The Client first has to check with .META Server in which Region Server a region belongs,
and it gets the path of that Region Server.
HBase Architecture: Meta Table
• The META table is a special HBase catalog table. It maintains a list of
all the Regions Servers in the HBase storage system, as you can see
in the above image.
• Looking at the figure you can see, .META file maintains the table in
form of keys and values. Key represents the start key of the region
and its id whereas the value contains the path of the Region Server.
HBase Architecture: Components of Region
Server
• WAL: As you can conclude from the above image, Write Ahead Log (WAL) is a file
attached to every Region Server inside the distributed environment. The WAL stores
the new data that hasn’t been persisted or committed to the permanent storage. It is
used in case of failure to recover the data sets.
• Block Cache: From the above image, it is clearly visible that Block Cache resides in the
top of Region Server. It stores the frequently read data in the memory. If the data in
BlockCache is least recently used, then that data is removed from BlockCache.
• MemStore: It is the write cache. It stores all the incoming data before committing it to
the disk or permanent memory. There is one MemStore for each column family in a
region. As you can see in the image, there are multiple MemStores for a region
because each region contains multiple column families. The data is sorted in
lexicographical order before committing it to the disk.
• HFile: From the above figure you can see HFile is stored on HDFS. Thus it stores the
actual cells on the disk. MemStore commits the data to HFile when the size of
MemStore exceeds.
HBase Architecture: How Search Initializes in
HBase?

• As you know, Zookeeper stores the META table location. Whenever

a client approaches with a read or writes requests to HBase
following operation occurs:
1. The client retrieves the location of the META table from the
ZooKeeper.
2. The client then requests for the location of the Region Server of
corresponding row key from the META table to access it. The client
caches this information with the location of the META Table.
3. Then it will get the row location by requesting from the
corresponding Region Server.
Q/A
What is HBase?
a) A distributed file system
b) A columnar database
c) A graph database
d) A key-value store

Ans: b) A columnar database

24
Q/A
Which of the following is NOT a characteristic of HBase?
a) High scalability
b) Strong data consistency
c) Fault tolerance
d) Fast random read and write access

Ans: b) Strong data consistency

25
Q/A
What is the primary data model used in HBase?
a) Key-value model
b) Document model
c) Relational model
d) Graph model

Ans a) Key-value model

26
References:

8/8/2021 27
THANK YOU

28
Apex Institute of Technology
Department of Computer Science & Engineering
Bachelor of Engineering (Computer Science & Engineering)
INTRODUCTION TO BDA– (21CST-246)
Prepared By: Dr. Geeta Rani (E15227)

DISCOVER . LEARN . EMPOWER

1
Hbase Architecture
HBase Architecture
• Region
• A region contains all the rows between the start key and the end key
assigned to that region.
• HBase tables can be divided into a number of regions in such a way that all
the columns of a column family is stored in one region.
• Each region contains the rows in a sorted order.
• Region Server
• Many regions are assigned to a Region Server
• It is responsible for handling, managing, executing reads and writes
operations on that set of regions.
HBase Architecture
• A table can be divided into a number of regions. A Region is a sorted
range of rows storing data between a start key and an end key.
• A Region has a default size of 256MB which can be configured
according to the need.
• A Group of regions is served to the clients by a Region Server.
• A Region Server can serve approximately 1000 regions to the client.
HBase Architecture: HMaster
HBase Architecture: HMaster
• HBase HMaster performs DDL operations (create and delete tables) and
assigns regions to the Region servers as you can see in the above image.
• It coordinates and manages the Region Server (similar as NameNode
manages DataNode in HDFS).
• It assigns regions to the Region Servers on startup and re-assigns regions
to Region Servers during recovery and load balancing.
• It monitors all the Region Server’s instances in the cluster (with the help
of Zookeeper) and performs recovery activities whenever any Region
Server is down.
• It provides an interface for creating, deleting and updating tables.
HBase Architecture: ZooKeeper – The
Coordinator
• Zookeeper acts like a coordinator inside HBase distributed environment. It helps in maintaining
server state inside the cluster by communicating through sessions.
• Every Region Server along with HMaster Server sends continuous heartbeat at regular interval
to Zookeeper and it checks which server is alive and available as mentioned in above image. It
also provides server failure notifications so that, recovery measures can be executed.
• Referring from the above image you can see, there is an inactive server, which acts as a backup
for active server. If the active server fails, it comes for the rescue.
• The active HMaster sends heartbeats to the Zookeeper while the inactive HMaster listens for
the notification send by active HMaster. If the active HMaster fails to send a heartbeat the
session is deleted and the inactive HMaster becomes active.
• While if a Region Server fails to send a heartbeat, the session is expired and all listeners are
notified about it. Then HMaster performs suitable recovery actions which we will discuss later
in this blog.
• Zookeeper also maintains the .META Server’s path, which helps any client in searching for any
region. The Client first has to check with .META Server in which Region Server a region belongs,
and it gets the path of that Region Server.
HBase Architecture: Meta Table
• The META table is a special HBase catalog table. It maintains a list of
all the Regions Servers in the HBase storage system, as you can see
in the above image.
• Looking at the figure you can see, .META file maintains the table in
form of keys and values. Key represents the start key of the region
and its id whereas the value contains the path of the Region Server.
HBase Architecture: Components of Region
Server
• WAL: As you can conclude from the above image, Write Ahead Log (WAL) is a file
attached to every Region Server inside the distributed environment. The WAL stores
the new data that hasn’t been persisted or committed to the permanent storage. It is
used in case of failure to recover the data sets.
• Block Cache: From the above image, it is clearly visible that Block Cache resides in the
top of Region Server. It stores the frequently read data in the memory. If the data in
BlockCache is least recently used, then that data is removed from BlockCache.
• MemStore: It is the write cache. It stores all the incoming data before committing it to
the disk or permanent memory. There is one MemStore for each column family in a
region. As you can see in the image, there are multiple MemStores for a region
because each region contains multiple column families. The data is sorted in
lexicographical order before committing it to the disk.
• HFile: From the above figure you can see HFile is stored on HDFS. Thus it stores the
actual cells on the disk. MemStore commits the data to HFile when the size of
MemStore exceeds.
HBase Architecture: How Search Initializes in
HBase?

• As you know, Zookeeper stores the META table location. Whenever

a client approaches with a read or writes requests to HBase
following operation occurs:
1. The client retrieves the location of the META table from the
ZooKeeper.
2. The client then requests for the location of the Region Server of
corresponding row key from the META table to access it. The client
caches this information with the location of the META Table.
3. Then it will get the row location by requesting from the
corresponding Region Server.
Q/A
In the HBase architecture, which component is responsible for
managing the overall system, coordinating various operations, and
assigning regions to RegionServers?
a) HMaster
b) RegionServer
c) ZooKeeper
d) DataNode
Ans: a) HMaster

13
•b) RegionServer

Q/A
Which component is responsible for handling read and write requests
from clients in HBase?
a) HMaster
b) RegionServer
c) ZooKeeper
d) DataNode
Ans: b) RegionServer

14
Q/A
What is the role of ZooKeeper in HBase?
a)Managing the Hadoop Distributed File System (HDFS)
b) Coordinating and synchronizing distributed processes in HBase
c) Serving client requests and managing tables
d) Storing and serving data in Hbase

Ans: b) Coordinating and synchronizing distributed processes in HBase

15
References:

8/8/2021 16
THANK YOU

17
Apex Institute of Technology
Department of Computer Science & Engineering
Bachelor of Engineering (Computer Science & Engineering)
INTRODUCTION TO BDA– (21CST-246)
Prepared By: Dr. Geeta Rani (E15227)

DISCOVER . LEARN . EMPOWER

1
HBase Architecture: ZooKeeper
– The Coordinator
HBase Architecture: ZooKeeper – The
Coordinator
• Zookeeper acts like a coordinator inside HBase distributed environment. It helps in maintaining
server state inside the cluster by communicating through sessions.
• Every Region Server along with HMaster Server sends continuous heartbeat at regular interval
to Zookeeper and it checks which server is alive and available as mentioned in above image. It
also provides server failure notifications so that, recovery measures can be executed.
• Referring from the above image you can see, there is an inactive server, which acts as a backup
for active server. If the active server fails, it comes for the rescue.
• The active HMaster sends heartbeats to the Zookeeper while the inactive HMaster listens for
the notification send by active HMaster. If the active HMaster fails to send a heartbeat the
session is deleted and the inactive HMaster becomes active.
• While if a Region Server fails to send a heartbeat, the session is expired and all listeners are
notified about it. Then HMaster performs suitable recovery actions which we will discuss later
in this blog.
• Zookeeper also maintains the .META Server’s path, which helps any client in searching for any
region. The Client first has to check with .META Server in which Region Server a region belongs,
and it gets the path of that Region Server.
HBase Architecture: Meta Table
• The META table is a special HBase catalog table. It maintains a list of
all the Regions Servers in the HBase storage system, as you can see
in the above image.
• Looking at the figure you can see, .META file maintains the table in
form of keys and values. Key represents the start key of the region
and its id whereas the value contains the path of the Region Server.
HBase Architecture: Components of Region
Server
• WAL: As you can conclude from the above image, Write Ahead Log (WAL) is a file
attached to every Region Server inside the distributed environment. The WAL stores
the new data that hasn’t been persisted or committed to the permanent storage. It is
used in case of failure to recover the data sets.
• Block Cache: From the above image, it is clearly visible that Block Cache resides in the
top of Region Server. It stores the frequently read data in the memory. If the data in
BlockCache is least recently used, then that data is removed from BlockCache.
• MemStore: It is the write cache. It stores all the incoming data before committing it to
the disk or permanent memory. There is one MemStore for each column family in a
region. As you can see in the image, there are multiple MemStores for a region
because each region contains multiple column families. The data is sorted in
lexicographical order before committing it to the disk.
• HFile: From the above figure you can see HFile is stored on HDFS. Thus it stores the
actual cells on the disk. MemStore commits the data to HFile when the size of
MemStore exceeds.
HBase Architecture: How Search Initializes in
HBase?

• As you know, Zookeeper stores the META table location. Whenever

a client approaches with a read or writes requests to HBase
following operation occurs:
1. The client retrieves the location of the META table from the
ZooKeeper.
2. The client then requests for the location of the Region Server of
corresponding row key from the META table to access it. The client
caches this information with the location of the META Table.
3. Then it will get the row location by requesting from the
corresponding Region Server.
Q/A
How does HBase handle data scalability?
a) By partitioning data into regions and distributing them across
RegionServers
b) By creating replicas of data in multiple clusters
c) By compressing data to reduce storage requirements
d) By optimizing query performance through indexing

Ans : a

9
Q/A
Which component in the HBase architecture provides distributed
coordination and synchronization services?
a) HMaster
b) RegionServer
c) ZooKeeper
d) DataNode

Ans: ZooKeeper

10
Q/A
What is a Region in HBase?
a) A key-value pair

b) b) A distributed file in HDFS

c) c) A unit of data storage and replication

d) d) A subset of rows in a table

Ans : c

11
References:

8/8/2021 12
THANK YOU

13
Apex Institute of Technology
Department of Computer Science & Engineering
Bachelor of Engineering (Computer Science & Engineering)
INTRODUCTION TO BDA– (21CST-246)
Prepared By: Dr. Geeta Rani (E15227)

DISCOVER . LEARN . EMPOWER

• As you know, Zookeeper stores the META table location. Whenever

a client approaches with a read or writes requests to HBase
following operation occurs:
1. The client retrieves the location of the META table from the
ZooKeeper.
2. The client then requests for the location of the Region Server of
corresponding row key from the META table to access it. The client
caches this information with the location of the META Table.
3. Then it will get the row location by requesting from the
corresponding Region Server.
Q/A
Which component in the HBase architecture directly interacts with
ZooKeeper?
a) HMaster
b) RegionServer
c) DataNode
d) HDFS NameNode

Ans: HMaster

9
Q/A
Which consistency model is followed by ZooKeeper?
a) Eventual consistency
b) Strong consistency
c) Sequential consistency
d) Event-driven consistency

Ans: b) Strong consistency

10
Q/A
Which of the following operations can be performed on a znode in
ZooKeeper?
a) Read and write data
b) Query with SQL-like queries
c) Create and delete tables
d) Execute distributed data processing tasks

Ans :a) Read and write data

11
References:

8/8/2021 12
THANK YOU

DISCOVER . LEARN . EMPOWER

1
FLUME ARCHITECTURE
FLUME – ARCHITECTURE

• Data generators (such as Facebook, Twitter) generate data

• which gets collected by individual Flume agents running on them.
• Thereafter, a data collector (which is also an agent) collects the data from the
agents
• which is aggregated and pushed into a centralized store such as HDFS or HBase.
Flume Event

• An event is the basic unit of the data transported inside

Flume. It contains a payload of byte array that is to be
transported from the source to the destination accompanied by
optional headers. A typical Flume event would have the
following structure:
Flume Agent

• An agent is an independent daemon process (JVM) in Flume. It receives the

data (events) from clients or other agents and forwards it to their next
destination.
• A Flume Agent contains three main components namely, source, channel,
and sink.
Source

• A source receives data from the log/event data generators such as

Facebook, Twitter, and other webservers, and transfers it to the
channel in the form of Flume events.
• Data generators like webservers generate data and deliver it to the
agent. A source is a component of the agent which receives this data
and transfers it to one or more channels.
• Apache Flume supports several types of sources and each source
receives events from a specified data generator. For example, Avro
source receives data from the clients which generate data in the form
of Avro files.
• Flume supports the following sources: Avro, Exec, Spooling directory,
Net Cat, Sequence generator, Syslog, Multiport TCP, Syslog UDP, and
HTTP.
Channel

• A channel is a transient store which receives the events from

the source and buffers them till they are consumed by sinks. It
acts as a bridge between the sources and the sinks.
• These channels are fully transactional and they can work with
any number of sources and sinks. Example: JDBC channel, File
system channel, Memory channel, etc.
Sink

• Finally, the sink stores the data into centralized stores like
HBase and HDFS.
• It consumes the data (events) from the channels and delivers it
to the destination.
• The destination of the sink might be another agent or the
central stores.
• Example: HDFS sink. Flume supports the following sinks: HDFS
sink, Logger, Avro, Thrift, IRC, File Roll, Null sink, HBase, and
Morphline solr.
References:

8/8/2021 9
THANK YOU

10
Apex Institute of Technology
Department of Computer Science & Engineering
Bachelor of Engineering (Computer Science & Engineering)
INTRODUCTION TO BDA– (21CST-246)
Prepared By: Dr. Geeta Rani (E15227)

DISCOVER . LEARN . EMPOWER

1
Sqoop
Apache SQOOP (SQL-to-Hadoop)
• Apache SQOOP (SQL-to-Hadoop) is a tool designed to support
bulk export and import of data into HDFS from structured data
stores such as relational databases, enterprise data warehouses,
and NoSQL systems.
• It is a data migration tool based upon a connector architecture
which supports plugins to provide connectivity to new external
systems.
Sqoop: SQL to Hadoop and Hadoop to SQL Tool

• Sqoop is a tool used for data transfer between RDBMS (like MySQL,
Oracle SQL etc.) and Hadoop (Hive, HDFS, and HBASE etc.)
• It is used to import data from RDBMS to Hadoop and export data
from Hadoop to RDBMS.
• Again Sqoop is one of the top projects by Apache software
foundation and works brilliantly with relational databases such as
Teradata, Netezza, Oracle, MySQL, and Postgres etc.
• In Sqoop, developers just need to mention the source, destination
and the rest of the work will be done by the Sqoop tool.
Features of Sqoop

• Sqoop is robust, easily usable and has community support and

contribution.
• Currently, we are using Sqoop latest versi
• Full Load
• Incremental Load
• Parallel import/export
• Import results of SQL query
• Compression
• Connectors for all major RDBMS Databases
• Kerberos Security Integration
• Load data directly into Hive/Hbase
• Support for Accumuloon 1.4.6.
References:

8/8/2021 8
THANK YOU

9
Apex Institute of Technology
Department of Computer Science & Engineering
Bachelor of Engineering (Computer Science & Engineering)
INTRODUCTION TO BDA– (21CST-246)
Prepared By: Dr. Geeta Rani (E15227)

DISCOVER . LEARN . EMPOWER

1
Sqoop Architecture
Sqoop Architecture
• When Sqoop starts functioning, only
mapper job will run and reducer is
not required.
• Here is a detailed view of Sqoop
architecture with mapper
• Sqoop provides command line
interface to the end users and can
also be accessed using Java API.
• Here only Map phase will run and
reduce is not required because the
complete import and export process
doesn’t require any aggregation and
so there is no need for reducers in
Sqoop.
Sqoop Architecture

• There are mainly a couple of functions those Sqoop mainly

do-
• Import and
• Export
Sqoop Import

• The Sqoop import tool will import each table of the RDBMS in
Hadoop and each row of the table will be considered as a record in
the HDFS.
• All records are stored as text data in text files or as binary data in
Avro and Sequence files.
Sqoop Export
• The Sqoop export tool will export Hadoop files back to RDBMS
tables. The records in the HDFS files will be the rows of a table.
• Those are read and parsed into a set of records and delimited with a
user-specified delimiter.
Sqoop Installation

• Step 1: Download Sqoop

• Step 2: Start with the Sqoop installation
• Step 3: Configure bashrc file”
• Step 4: Configure Sqoop now
• Step5: Configure MySQL
• Step 6: Verify Sqoop
Q/A
What is Sqoop?
A distributed file system
b) A query language for relational databases
c) A data ingestion tool for transferring data between Hadoop and
relational databases
d) A data visualization and reporting tool

12
Q/A
Which of the following databases can be used as a source or target for
data transfer with Sqoop?
a) MySQL
b) Oracle
c) PostgreSQL
d) All of the above

13
Q/A
Which command is used to import data from a relational database
into Hadoop using Sqoop?
a) sqoop export
b) sqoop import
c) sqoop connect
d) sqoop load

14
References:

8/8/2021 15
THANK YOU

DISCOVER . LEARN . EMPOWER

1
(Hadoop) Pig Dataflow Language

12/6/2023
12/6/2023
Apache Pig
• Apache Pig is a platform for analyzing large data sets that consists of a
high-level language for expressing data analysis programs, coupled with
infrastructure for evaluating these programs.
• Pig's infrastructure layer consists of
– a compiler that produces sequences of Map-Reduce programs,
– Pig's language layer currently consists of a textual language called Pig Latin, which
has the following key properties:
• Ease of programming. It is trivial to achieve parallel execution of simple, "embarrassingly
parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data
transformations are explicitly encoded as data flow sequences, making them easy to write,
understand, and maintain.
• Optimization opportunities. The way in which tasks are encoded permits the system to
optimize their execution automatically, allowing the user to focus on semantics rather than
efficiency.
• Extensibility. Users can create their own functions to do special-purpose processing.

12/6/2023
12/6/2023
12/6/2023
12/6/2023
12/6/2023
12/6/2023
12/6/2023
12/6/2023
12/6/2023
12/6/2023
12/6/2023
12/6/2023
Running Pig
• You can execute Pig Latin statements:
– Using grunt shell or command line
$ pig ... - Connecting to ...
grunt> A = load 'data';
grunt> B = ... ;
– In local mode or hadoop mapreduce mode
$ pig myscript.pig
Command Line - batch, local mode mode
$ pig -x local myscript.pig
– Either interactively or in batch

12/6/2023
Program/flow organization
• A LOAD statement reads data from the file system.
• A series of "transformation" statements process the data.
• A STORE statement writes output to the file system; or, a DUMP
statement displays output to the screen.

12/6/2023
Interpretation
• In general, Pig processes Pig Latin statements as follows:
– First, Pig validates the syntax and semantics of all statements.
– Next, if Pig encounters a DUMP or STORE, Pig will execute the
statements.

A = LOAD 'student' USING PigStorage() AS (name:chararray,

age:int, gpa:float);
B = FOREACH A GENERATE name;
DUMP B;
(John)
(Mary)
(Bill)
(Joe)
• Store operator will store it in a file

12/6/2023
Simple Examples
A = LOAD 'input' AS (x, y, z);
B = FILTER A BY x > 5;
DUMP B;
C = FOREACH B GENERATE y, z;
STORE C INTO 'output';
-----------------------------------------------------------------------------
A = LOAD 'input' AS (x, y, z);
B = FILTER A BY x > 5;
STORE B INTO 'output1';
C = FOREACH B GENERATE y, z;
STORE C INTO 'output2'
12/6/2023
12/6/2023
Analysing using MapReduce

12/6/2023
Limitations with Mapreduce
• Analysis needs to be typically done in Java.
• Joins, that are performed, needs to be written in Java, which makes
it longer and more error-prone.
• For projection and filters, custom code needs to be written which
makes the whole process slower.
• The job is divided into many stages while using MapReduce, which
makes it difficult to manage.

12/6/2023
12/6/2023
QUESTION: ANALYZING HOW
MANY TWEETS ARE STORED PER
USER, IN THE GIVEN TWEET
TABLES?

12/6/2023
12/6/2023
Steps
• STEP 1– First of all, twitter imports the twitter tables (i.e.
user table and tweet table) into the HDFS.
• STEP 2– Then Apache Pig loads (LOAD) the tables into
Apache Pig framework.
• STEP 3– Then it joins and groups the tweet tables and user
table using COGROUP command as shown in the above
image.
• This results in the inner Bag Data type, which we will discuss
later in this blog.
• Example of Inner bags produced (refer to the above image) –
• (1,{(1,Jay,xyz),(1,Jay,pqr),(1,Jay,lmn)})
• (2,{(2,Ellie,abc),(2,Ellie,vxy)})
• (3, {(3,Sam,stu)})

12/6/2023
• STEP 4– Then the tweets are counted according to the users using
COUNT command. So, that the total number of tweets per user can
be easily calculated.
• Example of tuple produced as (id, tweet count) (refer to the above
image) –
• (1, 3)
• (2, 2)
• (3, 1)

12/6/2023
• STEP 5– At last the result is joined with user table to extract the
user name with produced result.
• Example of tuple produced as (id, name, tweet count) (refer to the
above image) –
• (1, Jay, 3)
• (2, Ellie, 2)
• (3, Sam, 1)
• STEP 6– Finally, this result is stored back in the HDFS.

12/6/2023
12/6/2023
More examples from Cloudera
• http://www.cloudera.com/wp-content/uploads/2010/01/IntroToPig.pdf
• https://www.edureka.co/blog/big-data-tutorial
• https://www.coursera.org/learn/big-data-introduction?specialization=big-data2.
• https://www.coursera.org/learn/fundamentals-of-big-data
• Big Data, Black Book: Covers Hadoop 2, MapReduce, Hive, YARN, Pig, R and Data Visualization, DT
Editorial Service, Dreamtech Press
• Big Data Analytics, Subhashini Chellappa, Seema Acharya, Wiley publications
• Big Data: Concepts, Technology, and Architecture, Nandhini Abirami R , Seifedine Kadry, Amir H.
Gandomi , Wiley publication

12/6/2023
Q/A
What is the primary purpose of Pig Latin, the language used in Apache
Pig?
a) Real-time data processing
b) Data storage and retrieval
c) Data integration and ETL (Extract, Transform, Load)
d) Data visualization and reporting

31
Q/A
What is the key concept in Pig Latin for representing and manipulating
data?
a) Tables
b) Relations
c) Schemas
d) Dataflows

32
Q/A
Which of the following is NOT a basic data type in Pig Latin?
a) Integer
b) Float
c) Boolean
d) Character

33
THANK YOU

34
Apex Institute of Technology
Department of Computer Science & Engineering
Bachelor of Engineering (Computer Science & Engineering)
INTRODUCTION TO BDA– (21CST-246)
Prepared By: Dr. Geeta Rani (E15227)

DISCOVER . LEARN . EMPOWER

1
Hive
Q/A
What is Hive in the context of Apache Hadoop?
a) A distributed file system
b) A query language for Hadoop
c) A machine learning framework
d) A data visualization tool

20
Q/A
Which programming language is commonly used to write Hive
queries?
1. Python

2. Java

3. SQL

4. C++

21
Q/A
What is the primary purpose of Hive?
a) Real-time data processing
b) Data storage and retrieval
c) Data integration and ETL (Extract, Transform, Load)
d) Data visualization and reporting

22
References:

8/8/2021 23
THANK YOU

24
Apex Institute of Technology
Department of Computer Science & Engineering
Bachelor of Engineering (Computer Science & Engineering)
INTRODUCTION TO BDA– (21CST-246)
Prepared By: Dr. Geeta Rani (E15227)

DISCOVER . LEARN . EMPOWER

1
Spark
Why Use Spark
 Many important applications must process large
streams of live data and provide results in near-real-
time
- Social network trends
- Website statistics
- Intrustion detection systems
- etc.
 Require large clusters to handle workloads
 Require latencies of few seconds
 Scalable to large clusters
 Second-scale latencies
 Simple programming model
• Integrated with batch & interactive processing
• Efficient fault-tolerance in stateful computations
Polyglot
• Spark provides high-level APIs
in Java, Scala, Python and R.
• Spark code can be written in
any of these four languages. It
provides a shell in Scala and
Python.
• The Scala shell can be accessed
through ./bin/spark-shell and
Python shell through
./bin/pyspark from the
installed directory.
Speed
• Spark runs up to 100 times
faster than Hadoop
MapReduce for large-scale
data processing. Spark is able
to achieve this speed through
controlled partitioning. It
manages data using partitions
that help parallelize distributed
data processing with minimal
network traffic.
Multiple Formats:
• Spark supports multiple data
sources such as Parquet, JSON,
Hive and Cassandra apart from
the usual formats such as text
files, CSV and RDBMS tables.
The Data Source API provides a
pluggable mechanism for
accessing structured data
though Spark SQL. Data
sources can be more than just
simple pipes that convert data
and pull it into Spark.
Lazy Evaluation:
• Apache Spark delays its
evaluation till it is absolutely
necessary. This is one of the
key factors contributing to its
speed. For transformations,
Spark adds them to a DAG
(Directed Acyclic Graph) of
computation and only when
the driver requests some data,
does this DAG actually gets
executed.
Real Time Computation:
• Spark’s computation is real-
time and has low latency
because of its in-memory
computation. Spark is designed
for massive scalability and the
Spark team has documented
users of the system running
production clusters with
thousands of nodes and
supports several
computational models.
Hadoop Integration:
• Apache Spark provides smooth
compatibility with Hadoop. This
is a boon for all the Big Data
engineers who started their
careers with Hadoop.
• Spark is a potential replacement
for the MapReduce functions of
Hadoop, while Spark has the
ability to run on top of an existing
Hadoop cluster using YARN for
resource scheduling.
Machine Learning:

• Spark’s MLlib is the machine

learning component which is
handy when it comes to big data
processing.
• It eradicates the need to use
multiple tools, one for processing
and one for machine learning.
• Spark provides data engineers
and data scientists with a
powerful, unified engine that is
both fast and easy to use.
Q/A
What is Apache Spark?
a) A distributed file system
b) A data storage and retrieval system
c) A data processing and analytics engine
d) A machine learning framework

32
Q/A
Which programming language is commonly used for developing Spark
applications?
Java
Python
C++
JavaScript

33
Q/A
Which of the following is NOT a component of the Apache Spark
architecture?
a) Spark Core
b) Spark SQL
c) Spark Streaming
d) Spark Machine Learning

34
References:

8/8/2021 35
THANK YOU

DISCOVER . LEARN . EMPOWER

1
Spark Architecture
(Continued)
Which of the following is NOT a component of the Apache Spark
architecture?
a) Spark Core
b) Spark SQL
c) Spark Streaming
d) Spark Machine Learning

24
Q/A
Which of the following is a supported data source in Spark?
a) Hadoop Distributed File System (HDFS)
b) MySQL
c) Amazon S3
d) All of the above

25
Q/A
Which of the following is NOT a machine learning library available in
Spark?
a) Spark MLlib
b) Spark GraphX
c) Spark ML
d) Spark TensorFlow

26
References:

8/8/2021 27
THANK YOU

Abdul Khaleeq Mohammed
No ratings yet
Abdul Khaleeq Mohammed
5 pages
Netflix Cookies
50% (2)
Netflix Cookies
4 pages
Iflex BODI 0.1
No ratings yet
Iflex BODI 0.1
337 pages
Data Warehouse Models and OLAP Operations: Enrico Franconi
No ratings yet
Data Warehouse Models and OLAP Operations: Enrico Franconi
45 pages
SAP S/4 HANA Offering SE16H Versus SE11/SE16N/SQVI:: +91 97435 15218 Demo Video
No ratings yet
SAP S/4 HANA Offering SE16H Versus SE11/SE16N/SQVI:: +91 97435 15218 Demo Video
10 pages
SQL Server Tutorial - Part 1
No ratings yet
SQL Server Tutorial - Part 1
17 pages
Entity Framework Book
No ratings yet
Entity Framework Book
952 pages
T-SQL Sample Questions: Page 1, Ccs Globaltech, Inc
No ratings yet
T-SQL Sample Questions: Page 1, Ccs Globaltech, Inc
4 pages
MongoDB Slides Until ClassTest
No ratings yet
MongoDB Slides Until ClassTest
221 pages
Alter Table: Table - Name ADD Column - Name Datatype
No ratings yet
Alter Table: Table - Name ADD Column - Name Datatype
5 pages
Oracle PLSQL Programming
No ratings yet
Oracle PLSQL Programming
138 pages
41 Essential SQL Interview Questions and Answers - Toptal®
100% (1)
41 Essential SQL Interview Questions and Answers - Toptal®
26 pages
ACID Properties: Atomicity
No ratings yet
ACID Properties: Atomicity
2 pages
Bda Unit-5 PDF
No ratings yet
Bda Unit-5 PDF
83 pages
Lecture 3.1.2
No ratings yet
Lecture 3.1.2
47 pages
JDBC - Chapter 4
No ratings yet
JDBC - Chapter 4
8 pages
Basic Commands On Mongo Shell
No ratings yet
Basic Commands On Mongo Shell
5 pages
DDL Statements
No ratings yet
DDL Statements
4 pages
Nosql
No ratings yet
Nosql
13 pages
Department of Information Sciences and Technologies
No ratings yet
Department of Information Sciences and Technologies
12 pages
70-467 Designing Business Intelligence Solutions With Microsoft SQL Server 2012
No ratings yet
70-467 Designing Business Intelligence Solutions With Microsoft SQL Server 2012
21 pages
Insert Data in Database Using PHP
No ratings yet
Insert Data in Database Using PHP
8 pages
Unit 6
No ratings yet
Unit 6
143 pages
Dbms R, JUNE 2022
No ratings yet
Dbms R, JUNE 2022
4 pages
Learning Guide 2.1 - CloudDatabase - NOSQL PDF
No ratings yet
Learning Guide 2.1 - CloudDatabase - NOSQL PDF
44 pages
DBMS Lab Manual PCCCS591
No ratings yet
DBMS Lab Manual PCCCS591
18 pages
Nosql PDF
No ratings yet
Nosql PDF
21 pages
Brief Review On SQL and NoSQL
No ratings yet
Brief Review On SQL and NoSQL
4 pages
Cs 620 / Dasc 600 Introduction To Data Science & Analytics: Lecture 6-Nosql
No ratings yet
Cs 620 / Dasc 600 Introduction To Data Science & Analytics: Lecture 6-Nosql
31 pages
Synapse Dedicated SQL Pool Internals
No ratings yet
Synapse Dedicated SQL Pool Internals
20 pages
DSA 4-Introduction To NoSQL
No ratings yet
DSA 4-Introduction To NoSQL
59 pages
No SQL Database Compiled
No ratings yet
No SQL Database Compiled
20 pages
Difference Between Explain Plan and Autotrace: %cpu Time
No ratings yet
Difference Between Explain Plan and Autotrace: %cpu Time
2 pages
SQL 1 Lecture Notes 1
No ratings yet
SQL 1 Lecture Notes 1
16 pages
Full Stack-Unit-Iii
No ratings yet
Full Stack-Unit-Iii
56 pages
DBMS Unit 5 Notes
No ratings yet
DBMS Unit 5 Notes
57 pages
Features of Nosql: Non-Relational
No ratings yet
Features of Nosql: Non-Relational
7 pages
BD Unit 4
No ratings yet
BD Unit 4
45 pages
Nosql 20240103 114025 0000
No ratings yet
Nosql 20240103 114025 0000
24 pages
NoSQL Vs SQL Databases Explained
No ratings yet
NoSQL Vs SQL Databases Explained
23 pages
Lecture 1
No ratings yet
Lecture 1
31 pages
Unit Ii - Nosql Databases
No ratings yet
Unit Ii - Nosql Databases
112 pages
Dbms Presentation
No ratings yet
Dbms Presentation
22 pages
Unit 2
No ratings yet
Unit 2
65 pages
Lec 15 Notes
No ratings yet
Lec 15 Notes
3 pages
NOSQL
No ratings yet
NOSQL
25 pages
Unit 3 NoSQL
No ratings yet
Unit 3 NoSQL
98 pages
Unit 2 Handouts
No ratings yet
Unit 2 Handouts
11 pages
NOsql Presentation
No ratings yet
NOsql Presentation
20 pages
Medical Shop Management System
No ratings yet
Medical Shop Management System
24 pages
Bda Unit-2
No ratings yet
Bda Unit-2
29 pages
Bda CHP 3
No ratings yet
Bda CHP 3
75 pages
Two Phase Commit
No ratings yet
Two Phase Commit
10 pages
NOSQL Lecture 1 Notes
No ratings yet
NOSQL Lecture 1 Notes
31 pages
Lecture 1 - NoSQL
No ratings yet
Lecture 1 - NoSQL
31 pages
Unit 2
No ratings yet
Unit 2
23 pages
NoSQL Tutorial - New
No ratings yet
NoSQL Tutorial - New
10 pages
Unit 1 Dbms - Patel
No ratings yet
Unit 1 Dbms - Patel
183 pages
Comandos de SQL
No ratings yet
Comandos de SQL
1 page
NoSQL
No ratings yet
NoSQL
18 pages
Lab 04
No ratings yet
Lab 04
3 pages
Bcse302l Dbms Module-7 Nosql
No ratings yet
Bcse302l Dbms Module-7 Nosql
30 pages
Nosql Databases
No ratings yet
Nosql Databases
2 pages
Big Data Unit 1
No ratings yet
Big Data Unit 1
194 pages
Unit 2
No ratings yet
Unit 2
26 pages
Module 5 - NoSQL Databases
No ratings yet
Module 5 - NoSQL Databases
33 pages
CDE Sample Interview Questions
No ratings yet
CDE Sample Interview Questions
10 pages
Python Unit - 2
No ratings yet
Python Unit - 2
142 pages
No SQL
No ratings yet
No SQL
12 pages
Session 8 - NoSQL
No ratings yet
Session 8 - NoSQL
17 pages
Unit II No-SQL DB Managment
No ratings yet
Unit II No-SQL DB Managment
33 pages
NoSQL Notes
No ratings yet
NoSQL Notes
11 pages
BIG Data 2
No ratings yet
BIG Data 2
18 pages
Peace Women Savings Data
No ratings yet
Peace Women Savings Data
14 pages
CH.5 NOSQL Database For Business Applications
No ratings yet
CH.5 NOSQL Database For Business Applications
21 pages
Chap 4
No ratings yet
Chap 4
18 pages
Operating System Unit 1
No ratings yet
Operating System Unit 1
210 pages
Spark SQL PPT 3.2.3 and 3.2.4
No ratings yet
Spark SQL PPT 3.2.3 and 3.2.4
17 pages
1.1.2 and 1.1.3
No ratings yet
1.1.2 and 1.1.3
21 pages
1.1.4 and 1.1.5
No ratings yet
1.1.4 and 1.1.5
38 pages
Business Intelligence Unit - 1
No ratings yet
Business Intelligence Unit - 1
55 pages
DBMS - Unit 5 (NoSQL Databases)
No ratings yet
DBMS - Unit 5 (NoSQL Databases)
35 pages
SPark Monitoring and Tuning PPT 3.3.1
No ratings yet
SPark Monitoring and Tuning PPT 3.3.1
15 pages
Unit1.1.1 RTHFGBCV TRHBGFV TDHNGFB
No ratings yet
Unit1.1.1 RTHFGBCV TRHBGFV TDHNGFB
26 pages
Chapter 1 - Introducing Big Data & NoSQL
No ratings yet
Chapter 1 - Introducing Big Data & NoSQL
14 pages
01 NSQL
No ratings yet
01 NSQL
5 pages
Full Stack UNIT3
No ratings yet
Full Stack UNIT3
57 pages
NoSQL Complete QB
No ratings yet
NoSQL Complete QB
43 pages
Assignment-1 Spark and Scala
No ratings yet
Assignment-1 Spark and Scala
1 page
Bda Unit12
No ratings yet
Bda Unit12
9 pages
Nosql Module 1
No ratings yet
Nosql Module 1
23 pages
Unit 3 Nosql Databases Adt
No ratings yet
Unit 3 Nosql Databases Adt
64 pages
No SQL
No ratings yet
No SQL
32 pages
U5 Final
No ratings yet
U5 Final
45 pages
BDA Module 5 - Part1 (No SQL) 2023
No ratings yet
BDA Module 5 - Part1 (No SQL) 2023
32 pages
Unit II - BIG DATA ANALYTICS
No ratings yet
Unit II - BIG DATA ANALYTICS
11 pages
Module 1 Introduction
No ratings yet
Module 1 Introduction
9 pages
Types of NoSQL Databases - GeeksforGeeks
No ratings yet
Types of NoSQL Databases - GeeksforGeeks
9 pages
Module 3 Bigdata Analytics
No ratings yet
Module 3 Bigdata Analytics
19 pages
No SQL
No ratings yet
No SQL
3 pages