Unit 4 BDA
Unit 4 BDA
NoSQL (Not Only SQL) is a non-relational database used to manage unstructured data. It is a
distributed database system designed to work in virtual environments, providing mechanisms
for data storage and retrieval with a focus on scalability, high performance, availability, and
agility.
It was developed in response to the need to store a large volume of user-related data. NoSQL
databases are designed to scale easily and to handle products and objects that need to be
frequently accessed, updated, and changed, keeping up with the needs of the modern
industry.
Relational databases:
1. Not Only SQL – SQL and other query languages can be used.
2. Non-relational and schema-free – No fixed structure required.
3. No JOINs – Avoids complex join operations.
4. Distributed architecture – Runs on multiple processors/nodes.
5. Horizontally scalable – Add more machines instead of upgrading one.
6. Open-source options – Many available for free.
7. Easy data replication – For better performance and backup.
8. Simple API usage – Easy to implement.
9. Handles huge volumes of data – Efficient at big data processing.
10. Can be run on commodity hardware – Follows shared nothing concept.
Why NoSQL? :
A traditional database model is not suitable for all types of applications, especially those with:
High performance
Flexible structure
Scalability
Capability to handle dynamic data
Although NoSQL may not provide full ACID (Atomicity, Consistency, Isolation, Durability)
properties, it guarantees BASE properties:
Basically Available
Soft State
Eventually Consistent
CAP Theorem says a distributed system cannot guarantee all three of the following at the
same time:
Consistency – All nodes show the same data at the same time.
Availability – Every request gets a response (success or failure).
Partition Tolerance – System continues working despite network failure.
Basically Available – System responds to every request, even if the data is not consistent.
Soft State – System state can change over time even without input (due to eventual
consistency).
Eventually Consistent – All changes will eventually reflect across all nodes, but not
immediately.
Characteristics of BASE:
If data is consistent and available with no partition, then data is replicated and available in
both servers (A and B).
If data is available and partitioned, then it's not consistent. Example: Server A has new
data, B has old.
If data is consistent and partitioned, then it may not be available (B is waiting for update
from A).
There are around 150 NoSQL databases in the market. Some popular ones include:
Google BigTable
Apache Hadoop
MapReduce
SimpleDB
MemcacheDB
1. Volume :
Organizations now generate huge volumes of data. RDBMS systems often fail due to
limitations in single CPU performance. When dealing with large datasets, distributed
processing using clusters of commodity (low-cost) machines becomes necessary.
Apache Hadoop
HDFS
MapR
HBase
These systems break large data into smaller chunks and process them in parallel.
2. Velocity :
NoSQL systems handle these high-speed real-time operations efficiently and ensure low
response time, even during heavy traffic.
3. Variability :
Data often comes in different formats and structures. In RDBMS, changing the schema (table
design) for new data fields is difficult and can affect the entire system.
Example: If you want to store a special field for a few customers, you need to change the
entire table schema. This creates a sparse matrix (empty fields for others) and affects
performance.
NoSQL systems offer schema-less models, allowing storage of different kinds of data without
any rigid structure.
4. Agility :
Handling complex queries in RDBMS requires multiple nested queries and object-relational
mapping layers (ORM) using frameworks like Hibernate or Java. This slows down development
and updates.
1. 24x7 Availability
Read/write data from any location without knowing the physical location of the node
Data is synchronized across regions
Ensures fast local access and global availability
Scalability
Data distribution
Continuous availability
Support for multi-data centers
Key-Value Store
Column Store
Document Store
Graph Store
1. Key-Value Store
A key-value store stores data as a pair of key and value, just like a dictionary.
How it works:
Operation Description
Get(key) Retrieves value using the key
Put(key, value) Stores or updates value with the key
Multi-Get(key1, key2...) Retrieves multiple values
Delete(key) Deletes the value for the key
Rules:
Weaknesses:
Use Cases:
Caching
Session storage
Image stores
Dictionaries (word-definition pairs)
Stores data in columns instead of rows. It is good for storing large and sparse datasets.
Key Concepts:
Structure Format:
Use Cases:
Analytics
Time-series data
IoT (Internet of Things) systems
Social media posts
3. Document Store
A document store is like a smart key-value store, where the value is a document (usually in
JSON or XML format).
Features:
How it works:
Use Cases:
4. Graph Store
A graph store uses nodes and relationships to represent and store data.
It is based on graph theory.
Structure:
Key Benefits:
Use Cases:
1. Key-Value Store
2. Document Store
Can be used in IoT systems, where sensors push data into JSON-like documents.
4. Graph Store
1. Distributed Architecture
2. Federated Architecture
Healthcare systems
Integrate streams
Example:
1. Recommendation Systems
Used in:
NoSQL helps create personalized user experiences by managing large and fast-changing user
data in real-time. It supports flexible data models, making it ideal for building and updating
user profiles on the fly.
User preferences
Authentication
Online transactions
As user numbers grow, so does the complexity of data. NoSQL supports flexible schema and
quick read/write operations, making it perfect for managing evolving user profiles.
4. Content Management
Includes managing:
Relational databases struggle with unstructured data. NoSQL supports semi-structured and
unstructured data using a flexible schema, making it ideal for content-heavy applications.
5. Catalog Management
Large companies handle product/service catalogs with many categories and updates. NoSQL:
8. Fraud Detection
Financial services must detect fraud in milliseconds when processing transactions. NoSQL:
Logs record all events like clicks, errors, or transactions. Earlier, storing these was expensive.
Now, NoSQL allows:
Low-cost storage
Easy access and analysis
Architectural Models:
Cache Friendliness:
NoSQL makes data distribution easier by focusing on aggregates. Two key techniques:
1. Sharding
2. Replication
Some systems like Riak use both sharding and replication for optimal performance.
Introduction to MongoDB:
MongoDB is a NoSQL database designed to handle large amounts of unstructured or semi-
structured data. Unlike traditional databases that store data in rows and columns (tables),
MongoDB stores data in a document format using JSON-like structures, making it highly
flexible and scalable. In simple words: MongoDB is a database that stores information in a
format similar to JSON, allowing you to store complex data types easily and quickly without
worrying about strict table structures.
MongoDB is
1. Cross-platform
2. Open source
3. Non-relational
4. Distributed
5. NoSQl
6. Document-oriented data store
Terms used:
1. Database
Example: A database named collegeDB may contain collections like students, teachers, results.
2. Collection
3. Document
Example Document:
MongoDB provides support for dynamic queries using a rich query language based on JSON.
You can query using field values, ranges, conditions, pattern matching, etc.
No need for strict joins or SQL syntax.
Example:
This returns all students whose age is greater than 20. This feature is extremely useful in big
data environments, where the structure of data might change frequently.
MongoDB allows storing binary data (files, images, videos) using a feature called GridFS.
Useful when you need to store large files greater than 16MB.
MongoDB splits the files into smaller chunks and stores them across multiple documents.
Used in:
6. Replication
Helps in:
Data recovery
System fault-tolerance
Load balancing of read operations
7. Sharding
This updates only the age field in Priya’s document—in place, without touching other fields.
Useful in real-time apps where data changes frequently (e.g., live dashboards, IoT feeds).
Datatypes in MongoDB:
1. Double:
3. String:
4. Boolean:
5. Date:
6. ObjectId:
Example:
8. Array:
9. Binary Data:
11. Code