0% found this document useful (0 votes)
11 views19 pages

Module 3 Bigdata Analytics

The document provides an overview of NoSQL databases, highlighting their characteristics, types, and advantages for managing big data. It discusses various NoSQL database types such as document stores, key-value stores, column-family stores, and graph databases, emphasizing their flexibility, scalability, and performance. Additionally, it outlines the role of NoSQL in big data architecture, its pros and cons, and real-world applications in various industries.

Uploaded by

reethavinodha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views19 pages

Module 3 Bigdata Analytics

The document provides an overview of NoSQL databases, highlighting their characteristics, types, and advantages for managing big data. It discusses various NoSQL database types such as document stores, key-value stores, column-family stores, and graph databases, emphasizing their flexibility, scalability, and performance. Additionally, it outlines the role of NoSQL in big data architecture, its pros and cons, and real-world applications in various industries.

Uploaded by

reethavinodha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

BIGDATA ANALYTICS

NoSQL Big Data Management, MongoDB and Cassandra:


Introduction, NoSQL Data Store, NoSQL Data Architecture Patterns,
NoSQL to Manage Big Data, Shared-Nothing Architecture for Big
Data Tasks, MongoDB, Databases, Cassandra Databases.
NoSQL Big Data Management

Overview

 NoSQL (Not Only SQL) databases are designed for scalability, performance, and
flexibility—ideal for big data applications.
 Traditional relational databases (RDBMS) struggle with massive data volume, velocity,
and variety—commonly referred to as the 3Vs of Big Data.

NoSQL:

NoSQL stands for “Not Only SQL”. It refers to a category of database management systems
that differ from traditional relational databases (RDBMS). NoSQL is designed to handle large
volumes of unstructured, semi-structured, or structured data, making it ideal for Big Data
applications.

 Characteristics of NoSQL Databases

Feature Description

Schema-less No fixed schema; allows flexible data models.

Horizontal Scalability Easily scalable by adding more servers.

High Performance Optimized for read/write speed on massive datasets.

Distributed Architecture Built to run across multiple machines or data centers.

Support for Big Data Efficiently manages vast, complex, and fast-changing data.
NoSQL databases are characterized by their flexibility, scalability, and ability to handle large
volumes of diverse data. They offer distributed computing, flexible schemas, and are often more
cost-effective than traditional relational databases. They excel in handling unstructured and semi-
structured data, making them suitable for modern, agile applications.
Here's a more detailed look at the characteristics:

1. Flexible Data Models:


 NoSQL databases support various data models, including key-value, document, column-
family, and graph, allowing for diverse data storage and retrieval needs.
 They offer schema-less or schema-on-read options, meaning you don't need a rigid
schema before adding data, providing more flexibility for evolving data structures.
2. High Scalability and Availability:
 NoSQL databases are designed for horizontal scalability, meaning you can add more
nodes to your cluster to handle increased data volume and traffic without downtime.
 They are often built with distributed computing in mind, allowing for efficient data
management across multiple servers or nodes.
 High availability is a key focus, with replication strategies to ensure data redundancy and
prevent single points of failure.
3. Handling Unstructured and Semi-structured Data:
 NoSQL databases are well-suited for handling diverse data types, including unstructured
and semi-structured data, which are common in modern applications.
 They can easily adapt to changing data requirements and evolving data structures, unlike
relational databases with rigid schemas.
4. Cost-Effectiveness:
 NoSQL databases can be more cost-effective than traditional relational databases,
especially for large-scale deployments.
 Their ability to scale out horizontally often reduces the need for expensive hardware
upgrades.
5. Performance and Speed:
 NoSQL databases can offer high performance and low latency, particularly when
handling large data volumes and high traffic.
 Their distributed nature and optimized data models contribute to faster read and write
operations.
6. Distributed Architecture:
 Many NoSQL databases are built with distributed systems in mind, allowing for efficient
data management across multiple nodes or servers.
 This distributed approach enables horizontal scalability and fault tolerance.
7. Agile Development:
 NoSQL databases are well-suited for agile development methodologies due to their
flexible data models and ability to adapt to changing requirements.
 They allow developers to quickly iterate and build applications without being constrained
by rigid schemas.

 Types of NoSQL Databases

1. Key-Value Stores
o Structure: Data stored as a collection of key-value pairs.
o Use Case: Caching, session management.
o Examples: Redis, Riak, Amazon DynamoDB.
2. Document Stores
o Structure: JSON, BSON, or XML documents.
o Use Case: Content management systems, e-commerce platforms.
o Examples: MongoDB, CouchDB.
3. Column-Family Stores
o Structure: Data stored in columns instead of rows.
o Use Case: Analytics and data warehousing.
o Examples: Apache Cassandra, HBase.
4. Graph Databases
o Structure: Nodes and edges representing entities and relationships.
o Use Case: Social networks, recommendation engines.
o Examples: Neo4j, Amazon Neptune.

Types of NoSQL Databases


A database is a collection of structured data or information that is stored in a computer system
and can be accessed easily. A database is usually managed by a Database Management System
(DBMS). NoSQL databases are a category of non-relational databases designed to
handle large-scale, unstructured, and semi-structured data efficiently.

Unlike traditional relational databases (RDBMS) that store data in structured tables, NoSQL
databases offer flexibility, scalability, and high-performance solutions for modern
applications. In this article, we will explain

NoSQL databases can be classified into four main types, based on their data
storage and retrieval methods:

1. Document-based databases
2. Key-value stores
3. Column-oriented databases
4. Graph-based databases

Each type has unique advantages and use cases, making NoSQL a preferred choice for big
data applications, real-time analytics, cloud computing, and distributed systems.
1. Document-Based Database

The document-based database is a nonrelational database. Instead of storing the data in rows
and columns (tables), it uses the documents to store the data in the database. A document
database stores data in JSON, BSON, or XML documents.
Documents can be stored and retrieved in a form that is much closer to the data objects used in
applications which means less translation is required to use these data in the applications. In
the Document database, the particular elements can be accessed by using the index value that
is assigned for faster querying.
Collections are the group of documents that store documents that have similar contents. Not all
the documents are in any collection as they require a similar schema because document
databases have a flexible schema.

Key features of documents database:


 Flexible schema: Documents in the database has a flexible schema. It means the
documents in the database need not be the same schema.
 Faster creation and maintenance: the creation of documents is easy and minimal
maintenance is required once we create the document.
 No foreign keys: There is no dynamic relationship between two documents so documents
can be independent of one another. So, there is no requirement for a foreign key in a
document database.
 Open formats: To build a document we use XML, JSON, and others.
Popular Document Databases & Use Cases
Database Use Case

MongoDB Content management, product catalogs, user profiles

CouchDB Offline applications, mobile synchronization

Firebase Firestore Real-time apps, chat applications

2. Key-Value Stores

A key-value store is a non relational database. The simplest form of a NoSQL database is
a key-value store. Every data element in the database is stored in key-value pairs. The data
can be retrieved by using a unique key allotted to each element in the database. The values can
be simple data types like strings and numbers or complex objects. A key-value store is like a
relational database with only two columns which is the key and the value.
Key features of the key-value store:
 Simplicity: Data retrieval is extremely fast due to direct key access.
 Scalability: Designed for horizontal scaling and distributed storage.
 Speed: Ideal for caching and real-time applications.

Popular Key-Value Databases & Use Cases


Database Use Case

Redis Caching, real-time leaderboards, session storage

Memcached High-speed in-memory caching

Amazon DynamoDB Cloud-based scalable applications


3. Column Oriented Databases
A column-oriented database is a non-relational database that stores the data in columns instead
of rows. That means when we want to run analytics on a small number of columns, we can
read those columns directly without consuming memory with the unwanted data. Columnar
databases are designed to read data more efficiently and retrieve the data with greater speed. A
columnar database is used to store a large amount of data.

Key features of Columnar Oriented Database


 High Scalability: Supports distributed data processing.
 Compression: Columnar storage enables efficient data compression.
 Faster Query Performance: Best for analytical queries.

Popular Column-Oriented Databases & Use Cases


Database Use Case

Apache Cassandra Real-time analytics, IoT applications

Google Bigtable Large-scale machine learning, time-series data

HBase Hadoop ecosystem, distributed storage

5. Graph-Based Databases

Graph-based databases focus on the relationship between the elements. It stores the data in the
form of nodes in the database. The connections between the nodes are called links or
relationships, making them ideal for complex relationship-based queries.
 Data is represented as nodes (objects) and edges (connections).
 Fast graph traversal algorithms help retrieve relationships quickly.
 Used in scenarios where relationships are as important as the data itself.

Key features of Graph Database


 Relationship-Centric Storage: Perfect for social networks, fraud detection, and
recommendation engines.
 Real-Time Query Processing: Queries return results almost instantly.
 Schema Flexibility: Easily adapts to evolving relationship structures.
Popular Graph Databases & Use Cases
Database Use Case

Neo4j Fraud detection, social networks

Amazon Neptune Knowledge graphs, AI recommendations

ArangoDB Multi-model database, cybersecurity

Comparison of NoSQL Database Types

Feature Document-Based Key-Value Store Column-Oriented Graph-Based

JSON-like Key-Value Columns instead Nodes&


Data Model
documents pairs of rows Relationships

Fast lookups & Analytics & big Relationship-heavy


Best Use Case Semi-structured data
caching data data

Query Optimized for


Moderate Fast High for analytics
Performance relationships

Schema Flexible Dynamic Semi-structured Schema-less

High Scales with


Scalability Horizontal Highly scalable
horizontal relationships

MongoDB, Redis, Cassandra, Neo4j, Amazon


Examples
CouchDB DynamoDB HBase Neptune

Conclusion
NoSQL databases offer flexibility, scalability, and high performance, making them an essential
part of modern applications dealing with big data, real-time analytics, and distributed
systems. Choosing the right NoSQL database type depends on data structure, scalability
requirements, and query performance needs. By understanding these NoSQL database types
and their advantages, businesses and developers can make data-driven decisions to optimize
performance and scalability.
Architecture Patterns of NoSQL:
The data is stored in NoSQL in any of the following four data architecture patterns.

1. Key-Value Store Database

2. Column Store Database

3. Document Database

4. Graph Database

These are explained as following below.

1. Key-Value Store Database:

This model is one of the most basic models of NoSQL databases. As the name suggests, the
data is stored in form of Key-Value Pairs. The key is usually a sequence of strings, integers or
characters but can also be a more advanced data type. The value is typically linked or co-
related to the key. The key-value pair storage databases generally store data as a hash table
where each key is unique. The value can be of any type (JSON, BLOB(Binary Large Object),
strings, etc). This type of pattern is usually used in shopping websites or e-commerce
applications.

Advantages:
 Can handle large amounts of data and heavy load,
 Easy retrieval of data by keys.
Limitations:
 Complex queries may attempt to involve multiple key-value pairs which may delay
performance.
 Data can be involving many-to-many relationships which may collide.
Examples:
 DynamoDB
 Berkeley DB
2. Column Store Database:

Rather than storing data in relational tuples, the data is stored in individual cells which are
further grouped into columns. Column-oriented databases work only on columns. They store
large amounts of data into columns together. Format and titles of the columns can diverge from
one row to other. Every column is treated separately. But still, each individual column may
contain multiple other columns like traditional databases.
Basically, columns are mode of storage in this type.

Advantages:
 Data is readily available
 Queries like SUM, AVERAGE, COUNT can be easily performed on columns.
Examples:
 HBase
 Bigtable by Google
 Cassandra

3. Document Database:

The document database fetches and accumulates data in form of key-value pairs but here, the
values are called as Documents. Document can be stated as a complex data structure.
Document here can be a form of text, arrays, strings, JSON, XML or any such format. The use
of nested documents is also very common. It is very effective as most of the data created is
usually in form of JSONs and is unstructured.

Advantages:
 This type of format is very useful and apt for semi-structured data.
 Storage retrieval and managing of documents is easy.
Limitations:
 Handling multiple documents is challenging
 Aggregation operations may not work accurately.
Examples:
 MongoDB
 CouchDB

Figure - Document Store Model in form of JSON documents

4. Graph Databases:
Clearly, this architecture pattern deals with the storage and management of data in graphs.
Graphs are basically structures that depict connections between two or more objects in some
data. The objects or entities are called as nodes and are joined together by relationships called
Edges. Each edge has a unique identifier. Each node serves as a point of contact for the graph.
This pattern is very commonly used in social networks where there are a large number of
entities and each entity has one or many characteristics which are connected by edges. The
relational database pattern has tables that are loosely connected, whereas graphs are often very
strong and rigid in nature.

Advantages:
 Fastest traversal because of connections.
 Spatial data can be easily handled.
Limitations:
Wrong connections may lead to infinite loops.

Examples:
 Neo4J
 FlockDB( Used by Twitter)
Figure - Graph model format of NoSQL Databases

NoSQL for Big Data:

Challenge in Big Data NoSQL Advantage


Diverse data formats Schema flexibility
Rapid data ingestion High write throughput
Need for real-time analytics In-memory processing (e.g., Redis)
Horizontal scalability Easy to scale out across many servers
Fault tolerance Distributed and replicated systems

 NoSQL in Big Data Architecture

Here’s how NoSQL fits into a typical Big Data stack:

1. Data Sources: Social media, IoT devices, logs, transactions.


2. Ingestion: Tools like Apache Kafka, Flume.
3. Storage:
o NoSQL databases store the ingested data.
o Based on access pattern, the right NoSQL type is chosen.
4. Processing:
o Batch (Hadoop MapReduce)
o Real-time (Apache Storm, Spark Streaming)
5. Analytics & Visualization: Power BI, Tableau, custom dashboards.
 Pros and Cons of NoSQL for Big Data

✅ Pros:

 Flexibility to store any type of data.


 Superior scalability and performance.
 Designed for cloud and distributed environments.
 Better suited for real-time big data workloads.

❌ Cons:

 Lack of standardization across NoSQL systems.


 Querying can be more complex (no standard SQL).
 May lack strong consistency (depending on CAP trade-offs).

 Real-World Use Cases

Industry Application
E-commerce Product catalog management using MongoDB.
Finance Fraud detection using graph databases.
Healthcare Patient data aggregation using Couchbase.
Social Media Relationship graphs and news feed generation.
IoT Sensor data ingestion and querying using Cassandra.

 Conclusion

NoSQL Big Data Management is about leveraging NoSQL databases to efficiently store,
manage, and analyze massive, diverse, and fast-evolving datasets. It’s essential for modern
applications where flexibility, speed, and scalability matter more than rigid structure and ACID
compliance.

MongoDB and Cassandra: Introduction

MongoDB

 A document-oriented NoSQL database.


 Stores data in JSON-like format called BSON.
 Highly flexible schema; ideal for dynamic, semi-structured, or hierarchical data.
 Common use cases: content management, IoT, real-time analytics.

Cassandra

 A wide-column store, distributed NoSQL database.


 Developed originally at Facebook; based on Google Bigtable and Amazon Dynamo.
 Known for high availability, horizontal scalability, and fault tolerance.
 Suitable for high-write environments like messaging, IoT telemetry, or log aggregation.

NoSQL Data Store

Types of NoSQL Databases

1. Document Stores (e.g., MongoDB)


o Store semi-structured data as documents (JSON/BSON).
2. Key-Value Stores (e.g., Redis, Riak)
o Store data as key-value pairs.
3. Column-Family Stores (e.g., Cassandra, HBase)
o Store data in rows and columns, grouped into column families.
4. Graph Databases (e.g., Neo4j)
o Represent data as nodes and relationships (edges).

Features

 Schema-less design.
 High performance for read/write operations.
 Designed for horizontal scaling (adding servers).
 Efficient handling of unstructured and semi-structured data.

NoSQL Data Architecture Patterns

1. Single Table Design (Key-Value)


o All data stored in one large table with unique keys.
o Fast lookups.
2. Sharded Cluster
o Data partitioned across multiple machines (shards).
o Useful for distributed processing and scaling.
3. Polyglot Persistence
o Using different database types for different components of an application.
4. Event Sourcing
o State changes stored as a sequence of events.
o Common in CQRS (Command Query Responsibility Segregation) systems.
5. Materialized Views
o Precomputed data views optimized for queries.
o Reduces computation at query time.

NoSQL to Manage Big Data

Why NoSQL for Big Data?

 Scalability: Handles petabytes of data with ease.


 Flexibility: Accommodates varying data formats.
 High throughput: Optimized for large-scale read/write operations.
 Real-time performance: Supports fast data ingestion and processing.
Use Cases

 Social media platforms


 E-commerce recommendation engines
 Real-time analytics dashboards
 IoT sensor data collection

Shared-Nothing Architecture for Big Data Tasks

Definition

 Each node is independent and self-sufficient.


 No shared memory or disk between nodes.

Advantages

 High availability
 Easy horizontal scalability
 Fault isolation
 Avoids bottlenecks associated with shared resources

Common in

 Cassandra
 MongoDB (sharded clusters)
 Hadoop (HDFS + MapReduce)

MongoDB

Key Features

 Document-oriented
 Schema-less (dynamic schemas)
 Indexing support
 Aggregation framework
 Built-in replication and sharding

Architecture

 Replica Sets: Ensures high availability via primary-secondary replication.


 Sharding: Distributes data across multiple servers (horizontal scaling).
 Uses BSON for document storage.

Common Commands

 db.collection.insert()
 db.collection.find()
 db.collection.update()
 db.collection.aggregate()
MongoDB Databases

Structure

 Database > Collections > Documents


 A collection is equivalent to a table in RDBMS.
 A document is equivalent to a row.

Advantages

 Flexible schema: Add/remove fields without affecting other documents.


 Embedded documents and arrays allow data locality.
 Suited for fast development cycles.

Use Cases

 Content management systems


 Real-time analytics
 Catalog systems

Cassandra Databases

Key Features

 Distributed and decentralized (peer-to-peer architecture)


 Linear scalability (add nodes without downtime)
 High availability and fault tolerance
 Tunable consistency model

Architecture

 No master node: All nodes are equal.


 Uses a gossip protocol for inter-node communication.
 Partitioners distribute data across nodes.
 Replication ensures fault tolerance.

Data Model

 Keyspace: Equivalent to database


 Column Families: Equivalent to tables
 Rows: Identified by primary keys
 Columns: Stored in sorted order, grouped by families

Use Cases

 Time-series data
 Logging platforms
 Sensor data in IoT
 High-speed write environments
NoSQL databases are an essential part of Big Data management, designed to handle massive
volumes of structured, semi-structured, and unstructured data with high scalability, availability,
and performance.

🔍 What is NoSQL?

NoSQL stands for “Not Only SQL.” It refers to a broad class of database management systems
that:

 Don’t rely strictly on relational models or SQL.


 Are schema-less or have flexible schemas.
 Are built for distributed, scalable architectures.
 Handle high volumes of data efficiently.

🔍 Types of NoSQL Databases (with Big Data Focus)

Type Description Examples Big Data Use Cases


Key-Value Data stored as key–value Redis, Amazon Caching, session storage
Store pairs. DynamoDB, Riak
Document Stores documents MongoDB, Content management, real-
Store (usually JSON/BSON). CouchDB time analytics
Column- Stores data in rows and Apache Cassandra, Time-series data, IoT, log
Family Store dynamic columns. HBase aggregation
Graph Focused on nodes, edges, Neo4j, Amazon Social networks,
Database and properties. Neptune recommendation systems

🔍 NoSQL in Big Data Ecosystem

1. Apache HBase

 Based on Hadoop and HDFS.


 Columnar database optimized for real-time reads/writes on large data sets.
 Suitable for OLAP (read-heavy) and OLTP (write-heavy) systems.

bash
CopyEdit
# HBase shell example
hbase shell
create 'users', 'info'
put 'users', 'user1', 'info:name', 'Alice'
get 'users', 'user1'

2. MongoDB

 JSON-like document store.


 Scales horizontally via sharding.
 Common in real-time applications and content management.

js
CopyEdit
// MongoDB shell example
db.users.insert({ name: "Alice", age: 25 })
db.users.find({ age: { $gt: 20 } })

3. Apache Cassandra

 Peer-to-peer architecture with high availability.


 Great for time-series data and write-heavy workloads.

sql
CopyEdit
-- CQL (Cassandra Query Language) example
CREATE TABLE users (
user_id UUID PRIMARY KEY,
name TEXT,
email TEXT
);
INSERT INTO users (user_id, name, email) VALUES (uuid(), 'Bob', '[email protected]');

🔍 Benefits of NoSQL for Big Data

 Horizontal scalability: Add more servers easily.


 High performance: Optimized for fast reads/writes.
 Schema flexibility: Adapt to changing data structures.
 Resilience: Designed for distributed, fault-tolerant systems.

🔍 NoSQL vs RDBMS in Big Data

Feature RDBMS (SQL) NoSQL


Schema Fixed Flexible
Scalability Vertical Horizontal
Transactions ACID Often BASE
Feature RDBMS (SQL) NoSQL
Suitability for Big Data Limited Excellent

🔍 When to Use NoSQL for Big Data

✅ You have huge volumes of data (TBs or PBs).


✅ Schema may evolve or is not fixed.
✅ You need fast, distributed read/write operations.
✅ You're dealing with unstructured or semi-structured data.
✅ High availability and scalability are priorities.
Question Bank for Module 3

3-Marks Questions (Short Answer Type):

1. Define NoSQL and explain how it differs from traditional RDBMS.


2. What are the key characteristics of a NoSQL data store?
3. Differentiate between document-based and column-based NoSQL databases.

4-Marks Questions (Longer Answer Type):

1. Discuss how NoSQL databases are suitable for managing Big Data. Provide examples.
2. Describe the architecture and key components of MongoDB.
3. Discuss NoSQL Data Architecture Patterns with suitable examples.

8 marks:

1. Explain the different types of NoSQL data stores. How do they differ from traditional
relational databases in terms of structure and scalability?
2. Discuss the NoSQL data architecture patterns commonly used in big data systems.
Support your answer with examples.
3. Explain the characteristics of NoSQL in detail.

You might also like