Module 3 Bigdata Analytics
Module 3 Bigdata Analytics
Overview
NoSQL (Not Only SQL) databases are designed for scalability, performance, and
flexibility—ideal for big data applications.
Traditional relational databases (RDBMS) struggle with massive data volume, velocity,
and variety—commonly referred to as the 3Vs of Big Data.
NoSQL:
NoSQL stands for “Not Only SQL”. It refers to a category of database management systems
that differ from traditional relational databases (RDBMS). NoSQL is designed to handle large
volumes of unstructured, semi-structured, or structured data, making it ideal for Big Data
applications.
Feature Description
Support for Big Data Efficiently manages vast, complex, and fast-changing data.
NoSQL databases are characterized by their flexibility, scalability, and ability to handle large
volumes of diverse data. They offer distributed computing, flexible schemas, and are often more
cost-effective than traditional relational databases. They excel in handling unstructured and semi-
structured data, making them suitable for modern, agile applications.
Here's a more detailed look at the characteristics:
1. Key-Value Stores
o Structure: Data stored as a collection of key-value pairs.
o Use Case: Caching, session management.
o Examples: Redis, Riak, Amazon DynamoDB.
2. Document Stores
o Structure: JSON, BSON, or XML documents.
o Use Case: Content management systems, e-commerce platforms.
o Examples: MongoDB, CouchDB.
3. Column-Family Stores
o Structure: Data stored in columns instead of rows.
o Use Case: Analytics and data warehousing.
o Examples: Apache Cassandra, HBase.
4. Graph Databases
o Structure: Nodes and edges representing entities and relationships.
o Use Case: Social networks, recommendation engines.
o Examples: Neo4j, Amazon Neptune.
Unlike traditional relational databases (RDBMS) that store data in structured tables, NoSQL
databases offer flexibility, scalability, and high-performance solutions for modern
applications. In this article, we will explain
NoSQL databases can be classified into four main types, based on their data
storage and retrieval methods:
1. Document-based databases
2. Key-value stores
3. Column-oriented databases
4. Graph-based databases
Each type has unique advantages and use cases, making NoSQL a preferred choice for big
data applications, real-time analytics, cloud computing, and distributed systems.
1. Document-Based Database
The document-based database is a nonrelational database. Instead of storing the data in rows
and columns (tables), it uses the documents to store the data in the database. A document
database stores data in JSON, BSON, or XML documents.
Documents can be stored and retrieved in a form that is much closer to the data objects used in
applications which means less translation is required to use these data in the applications. In
the Document database, the particular elements can be accessed by using the index value that
is assigned for faster querying.
Collections are the group of documents that store documents that have similar contents. Not all
the documents are in any collection as they require a similar schema because document
databases have a flexible schema.
2. Key-Value Stores
A key-value store is a non relational database. The simplest form of a NoSQL database is
a key-value store. Every data element in the database is stored in key-value pairs. The data
can be retrieved by using a unique key allotted to each element in the database. The values can
be simple data types like strings and numbers or complex objects. A key-value store is like a
relational database with only two columns which is the key and the value.
Key features of the key-value store:
Simplicity: Data retrieval is extremely fast due to direct key access.
Scalability: Designed for horizontal scaling and distributed storage.
Speed: Ideal for caching and real-time applications.
5. Graph-Based Databases
Graph-based databases focus on the relationship between the elements. It stores the data in the
form of nodes in the database. The connections between the nodes are called links or
relationships, making them ideal for complex relationship-based queries.
Data is represented as nodes (objects) and edges (connections).
Fast graph traversal algorithms help retrieve relationships quickly.
Used in scenarios where relationships are as important as the data itself.
Conclusion
NoSQL databases offer flexibility, scalability, and high performance, making them an essential
part of modern applications dealing with big data, real-time analytics, and distributed
systems. Choosing the right NoSQL database type depends on data structure, scalability
requirements, and query performance needs. By understanding these NoSQL database types
and their advantages, businesses and developers can make data-driven decisions to optimize
performance and scalability.
Architecture Patterns of NoSQL:
The data is stored in NoSQL in any of the following four data architecture patterns.
3. Document Database
4. Graph Database
This model is one of the most basic models of NoSQL databases. As the name suggests, the
data is stored in form of Key-Value Pairs. The key is usually a sequence of strings, integers or
characters but can also be a more advanced data type. The value is typically linked or co-
related to the key. The key-value pair storage databases generally store data as a hash table
where each key is unique. The value can be of any type (JSON, BLOB(Binary Large Object),
strings, etc). This type of pattern is usually used in shopping websites or e-commerce
applications.
Advantages:
Can handle large amounts of data and heavy load,
Easy retrieval of data by keys.
Limitations:
Complex queries may attempt to involve multiple key-value pairs which may delay
performance.
Data can be involving many-to-many relationships which may collide.
Examples:
DynamoDB
Berkeley DB
2. Column Store Database:
Rather than storing data in relational tuples, the data is stored in individual cells which are
further grouped into columns. Column-oriented databases work only on columns. They store
large amounts of data into columns together. Format and titles of the columns can diverge from
one row to other. Every column is treated separately. But still, each individual column may
contain multiple other columns like traditional databases.
Basically, columns are mode of storage in this type.
Advantages:
Data is readily available
Queries like SUM, AVERAGE, COUNT can be easily performed on columns.
Examples:
HBase
Bigtable by Google
Cassandra
3. Document Database:
The document database fetches and accumulates data in form of key-value pairs but here, the
values are called as Documents. Document can be stated as a complex data structure.
Document here can be a form of text, arrays, strings, JSON, XML or any such format. The use
of nested documents is also very common. It is very effective as most of the data created is
usually in form of JSONs and is unstructured.
Advantages:
This type of format is very useful and apt for semi-structured data.
Storage retrieval and managing of documents is easy.
Limitations:
Handling multiple documents is challenging
Aggregation operations may not work accurately.
Examples:
MongoDB
CouchDB
4. Graph Databases:
Clearly, this architecture pattern deals with the storage and management of data in graphs.
Graphs are basically structures that depict connections between two or more objects in some
data. The objects or entities are called as nodes and are joined together by relationships called
Edges. Each edge has a unique identifier. Each node serves as a point of contact for the graph.
This pattern is very commonly used in social networks where there are a large number of
entities and each entity has one or many characteristics which are connected by edges. The
relational database pattern has tables that are loosely connected, whereas graphs are often very
strong and rigid in nature.
Advantages:
Fastest traversal because of connections.
Spatial data can be easily handled.
Limitations:
Wrong connections may lead to infinite loops.
Examples:
Neo4J
FlockDB( Used by Twitter)
Figure - Graph model format of NoSQL Databases
✅ Pros:
❌ Cons:
Industry Application
E-commerce Product catalog management using MongoDB.
Finance Fraud detection using graph databases.
Healthcare Patient data aggregation using Couchbase.
Social Media Relationship graphs and news feed generation.
IoT Sensor data ingestion and querying using Cassandra.
Conclusion
NoSQL Big Data Management is about leveraging NoSQL databases to efficiently store,
manage, and analyze massive, diverse, and fast-evolving datasets. It’s essential for modern
applications where flexibility, speed, and scalability matter more than rigid structure and ACID
compliance.
MongoDB
Cassandra
Features
Schema-less design.
High performance for read/write operations.
Designed for horizontal scaling (adding servers).
Efficient handling of unstructured and semi-structured data.
Definition
Advantages
High availability
Easy horizontal scalability
Fault isolation
Avoids bottlenecks associated with shared resources
Common in
Cassandra
MongoDB (sharded clusters)
Hadoop (HDFS + MapReduce)
MongoDB
Key Features
Document-oriented
Schema-less (dynamic schemas)
Indexing support
Aggregation framework
Built-in replication and sharding
Architecture
Common Commands
db.collection.insert()
db.collection.find()
db.collection.update()
db.collection.aggregate()
MongoDB Databases
Structure
Advantages
Use Cases
Cassandra Databases
Key Features
Architecture
Data Model
Use Cases
Time-series data
Logging platforms
Sensor data in IoT
High-speed write environments
NoSQL databases are an essential part of Big Data management, designed to handle massive
volumes of structured, semi-structured, and unstructured data with high scalability, availability,
and performance.
🔍 What is NoSQL?
NoSQL stands for “Not Only SQL.” It refers to a broad class of database management systems
that:
1. Apache HBase
bash
CopyEdit
# HBase shell example
hbase shell
create 'users', 'info'
put 'users', 'user1', 'info:name', 'Alice'
get 'users', 'user1'
2. MongoDB
js
CopyEdit
// MongoDB shell example
db.users.insert({ name: "Alice", age: 25 })
db.users.find({ age: { $gt: 20 } })
3. Apache Cassandra
sql
CopyEdit
-- CQL (Cassandra Query Language) example
CREATE TABLE users (
user_id UUID PRIMARY KEY,
name TEXT,
email TEXT
);
INSERT INTO users (user_id, name, email) VALUES (uuid(), 'Bob', '[email protected]');
1. Discuss how NoSQL databases are suitable for managing Big Data. Provide examples.
2. Describe the architecture and key components of MongoDB.
3. Discuss NoSQL Data Architecture Patterns with suitable examples.
8 marks:
1. Explain the different types of NoSQL data stores. How do they differ from traditional
relational databases in terms of structure and scalability?
2. Discuss the NoSQL data architecture patterns commonly used in big data systems.
Support your answer with examples.
3. Explain the characteristics of NoSQL in detail.