0% found this document useful (0 votes)
28 views13 pages

BDA IAT-2 Answer Key

Uploaded by

yash.engineering
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views13 pages

BDA IAT-2 Answer Key

Uploaded by

yash.engineering
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

REG.

5113
NO. :

(Approved by AICTE, affiliated to Anna University & Accredited by NBA)


DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
INTERNAL ASSESSMENT TEST – II – Answer Key
Sem & Branch: V / CSE(A) Subject: CCS334-BIG DATA ANALYTICS
Part-A
1. Why are NoSQL database known as Schemaless Database? (K3)(CO2)
NoSQL databases are called "schemaless" because they do not require a fixed, predefined
schema, allowing for flexible and dynamic data models where records can have varying
structures.
2. Define MemTable. (K1)(CO2)
A MemTable is an in-memory data structure used in certain NoSQL databases (such as HBase
or Cassandra) to store write operations (like inserts or updates) temporarily before they are
flushed to disk. When the MemTable reaches a certain size, its data is written to a persistent
storage file (like an SSTable or HFile). It plays a key role in optimizing write performance by
reducing direct disk writes.
3. Enumerate the term Graph Analytics. (K2)(CO2)
Graph Analytics refers to the process of analyzing and extracting insights from data that is
represented in the form of a graph, where data entities are nodes (or vertices) and their
relationships are edges. It is used to analyze complex relationships, patterns, and structures in
data. Key components of graph analytics include:
1. Node Analysis
2. Edge Analysis
3. Pathfinding
4. Community Detection
5. Centrality Measures
6. Graph Traversal
Graph analytics is widely used in areas like social networks, fraud detection, recommendation
systems, and network optimization.
4. What is NoSQL database? (K1)(CO2)
A NoSQL database is a type of non-relational database that allows for flexible, schema-less
data storage, designed to handle large-scale, unstructured, or semi-structured data. It supports a
variety of data models such as document, key-value, column-family, and graph, making it
highly scalable and adaptable to different data needs.
5. List out three business challenges in an organization. D (K2)(CO2)
Three common business challenges organizations face are:
1. Managing Growth: Scaling operations, infrastructure, and resources effectively while
maintaining quality and efficiency can be challenging as a company expands.
2. Adapting to Market Changes: Rapid shifts in market trends, consumer behavior, or
technological advancements require businesses to be flexible and responsive.
REG.
5113
NO. :

3. Talent Acquisition and Retention: Finding, developing, and retaining skilled employees is
critical to sustaining business performance and innovation.
6. Point out the aspects of adopting Big Data technologies. (K2)(CO2)
Adopting Big Data technologies involves aspects such as:
1. Data Management – Efficiently handling large volumes, variety, and velocity of data.
2. Scalability – Ensuring systems can grow with data needs.
3. Data Analytics – Leveraging advanced analytics for insights and decision-making.
4. Cost Efficiency – Managing infrastructure and processing costs effectively.
5. Security and Compliance – Ensuring data privacy, security, and regulatory adherence.
Part-B
7. (i) Explain Schemaless Database in detail. (8)(K2)(CO2)
A schemaless database refers to a type of database, typically associated with NoSQL databases, that
does not enforce a predefined or rigid schema for the structure of its data. Unlike relational databases
(RDBMS), where a schema must be defined in advance (specifying tables, columns, and data types),
schemaless databases provide more flexibility in how data is stored, allowing for unstructured or semi-
structured data to be used without constraints.
Key Features of Schemaless Databases:
1. No Fixed Structure:
In traditional databases, every record must adhere to a strict schema defined ahead of time. Schemaless
databases allow records (e.g., documents) within the same collection to have varying fields, data types,
and structures.
For example, in a document database like MongoDB, one document can contain fields name, age, and
email, while another document in the same collection can have just name and phone_number.
2. Flexible and Dynamic Data Models:
Data models in schemaless databases can evolve dynamically. Developers can add new fields or
attributes to records without having to modify a global schema or migrate data. This flexibility is
especially useful in environments with evolving data requirements.
3. Supports Unstructured and Semi-Structured Data:
Schemaless databases are designed to handle unstructured data (such as logs, social media posts, and
IoT sensor data) and semi-structured data (such as JSON, XML, or YAML formats).
This makes them ideal for applications that deal with diverse, unpredictable, or rapidly changing data
sets.
4. High Scalability:
Many schemaless databases are horizontally scalable, meaning they can distribute data across multiple
nodes or servers, enabling them to handle large-scale data efficiently. This is a critical advantage for
big data applications and real-time data processing.
5. Variety of Data Models:
Schemaless databases can support a variety of data models depending on the use case. The most
common types include:
Document databases (e.g., MongoDB, Couchbase): Store data as documents, often in formats like
JSON or BSON.
Key-Value stores (e.g., Redis, Amazon DynamoDB): Store data as simple key-value pairs.
Column-family stores (e.g., Cassandra, HBase): Store data in columns rather than rows, optimizing for
reading and writing large amounts of data.
REG.
5113
NO. :

Graph databases (e.g., Neo4j): Represent data as nodes (entities) and edges (relationships) in a graph
structure.
Benefits of Schemaless Databases:
1. Agility and Flexibility:
Developers can quickly adapt the database structure to changing requirements without needing
complex migrations or database redesigns.
2. Faster Iteration:
New features and fields can be added to the database without downtime or changes to the overall
schema, enabling rapid development cycles.
3. Scalability:
Schemaless databases are often designed for distributed environments, making them better suited for
handling large datasets and providing high availability across multiple servers or regions.
4. Efficient for Unpredictable Data:
They are well-suited for use cases where the structure of data is not known in advance, such as when
dealing with data from social media, IoT devices, or rapidly changing business needs.
Drawbacks of Schemaless Databases:
1. Lack of Data Integrity:
Since there is no enforced schema, it is up to the application to ensure consistency and integrity of data.
This can lead to issues if data is inserted with inconsistent or missing fields.
2. Complex Queries:
While NoSQL databases are optimized for certain types of queries (like key-value lookups or
document retrievals), they can be less efficient for complex queries involving joins, aggregations, or
multi-table relationships, which are easily handled by relational databases.
3. Data Duplication:
Schemaless databases may require denormalization (i.e., repeating the same data in multiple places),
which can lead to data redundancy and larger storage requirements.
4. Learning Curve:
The flexibility of schemaless databases requires developers to take more responsibility for data
modeling, consistency, and query optimization. This can introduce a learning curve for teams
transitioning from relational databases.
Examples of Schemaless Databases:
MongoDB: A document-oriented NoSQL database that stores data in flexible, JSON-like documents.
Couchbase: Another document-oriented database optimized for distributed data and scalability.
Cassandra: A column-family store designed for high scalability and large datasets.
Redis: A key-value store known for its high performance, often used for caching, session management,
and real-time analytics.
In summary, schemaless databases offer significant flexibility and scalability, making them ideal for
applications with dynamic or unstructured data. However, this comes at the cost of requiring more
careful data management and query optimization compared to traditional relational databases.
(ii)Elaborate Master-Slave replication in big data distributed system. (8)(K2)(CO2)
Master-Slave replication is a model for data replication often used in distributed systems, especially in
big data environments, to ensure data availability, fault tolerance, and consistency. In this
model, the master node holds the original copy of the data, while the slave nodes hold replicas.
Here’s a detailed breakdown:
1. Architecture Overview
REG.
5113
NO. :

 Master Node:
o Holds the authoritative copy of the data.
o Manages write operations (insert, update, delete) and is responsible for propagating
those changes to the slave nodes.
o Can coordinate or manage the entire cluster by keeping track of metadata, replication
processes, or coordination tasks.
 Slave Nodes:
o Hold replicated copies of the master’s data and are used for read operations.
o They receive updates from the master to maintain consistency with the master’s data.
o Often distributed across different locations to enhance availability and fault tolerance.
 Client Interactions:
o Clients may interact directly with the master for write operations.
o Read operations, which are often more frequent, can be distributed across slave nodes to
reduce the load on the master and to improve read performance.
2. Data Flow and Synchronization
 Write Process:
o All write operations go through the master. The master ensures data integrity, performs
any necessary operations, and updates the data.
o Once the write is processed, the master propagates the changes to the slaves through
replication mechanisms (e.g., logs, snapshots).
 Replication Modes:
o Synchronous Replication: The master waits for an acknowledgment from slaves before
committing the transaction, ensuring strong consistency. However, this can introduce
latency.
o Asynchronous Replication: The master immediately commits the transaction without
waiting for confirmation from slaves. This improves performance but can lead to
temporary inconsistencies (eventual consistency).
 Consistency: In large distributed systems, eventual consistency is often employed, where
slave nodes might lag slightly behind the master but eventually reach a consistent state.
3. Advantages of Master-Slave Replication
 Improved Read Performance: By offloading read operations to slave nodes, the system can
handle a higher number of read requests without burdening the master.
 Fault Tolerance: In the event of master node failure, the system can promote a slave to
become the new master, ensuring high availability. Additionally, having multiple copies of data
across slave nodes protects against data loss.
 Scalability: As data volume grows, more slaves can be added to handle increased read traffic,
allowing the system to scale horizontally.
4. Challenges and Limitations
REG.
5113
NO. :

 Single Point of Failure (Master): The master node becomes a potential bottleneck and a single
point of failure, although mechanisms like master failover or master election can mitigate this.
 Replication Latency: In asynchronous replication, there is a delay between when data is
written to the master and when it is replicated to slaves, leading to possible inconsistencies.
 Write Bottleneck: All writes must pass through the master, which can limit scalability for
write-heavy workloads.
 Slave Staleness: Since slaves are passive in the write process, they can lag behind the master
and serve slightly outdated data in read operations.
8. (i) Neatly explain about Consistency, Update Consistency, Read Consistency, Quorums and
Relaxing Durability. (10)(K2)(CO2)
1. Consistency
In distributed systems, consistency refers to the guarantee that all nodes (servers, replicas) in the
system see the same data at the same time. It ensures that once data is written, all subsequent reads
return the latest written value.
Types of Consistency Models:
 Strong Consistency: Every read returns the most recent write across the system. This model
ensures that there are no outdated or stale reads but often comes with trade-offs in performance
and availability (CAP theorem).
 Eventual Consistency: Nodes may temporarily have different versions of data, but given
enough time (without new updates), all nodes will converge to the same value. Eventual
consistency is common in systems where high availability and partition tolerance are
prioritized.
2. Update Consistency
Update consistency refers to how updates (write operations) are propagated across replicas in a
distributed system to ensure data consistency.
 In synchronous replication, the system waits for updates to be applied to all replicas before
acknowledging the write. This ensures strong update consistency but can introduce latency.
 In asynchronous replication, the system acknowledges the write as soon as it is committed on
the primary node, without waiting for replicas to confirm. This leads to eventual consistency
but increases performance at the risk of temporary inconsistencies across nodes.
Update consistency strategies directly affect how soon all replicas see and agree on the updated data
after a write.
3. Read Consistency
Read consistency refers to the guarantees that a system provides about the value returned by a read
operation.
 Strong Read Consistency: Every read reflects the most recent write. This is typical in systems
that enforce synchronous replication where read-after-write consistency is maintained across all
nodes.
 Read-Your-Writes Consistency: A weaker form of consistency where a system guarantees
that after a user performs a write, any subsequent reads by the same user will reflect that write,
even if other users may see stale data.
 Eventual Read Consistency: The system may return stale data from a replica, but over time,
all nodes will converge to the correct value. This is common in distributed systems optimized
for high availability, like those using asynchronous replication.
REG.
5113
NO. :

The trade-offs between strong read consistency and eventual read consistency revolve around latency
and performance.
4. Quorums
Quorums are a mechanism used in distributed systems to ensure consistency and fault tolerance. In a
quorum-based system, any operation (read or write) must reach a certain number of nodes to be
considered successful. The quorum is typically a majority of nodes, which allows the system to
maintain consistency even if some nodes are unavailable.
 Write Quorum (W): The minimum number of nodes that must acknowledge a write operation
for it to be considered successful. If W is large, the system leans toward stronger consistency,
but at the cost of write performance.
 Read Quorum (R): The minimum number of nodes that must be contacted during a read
operation. If R is large, the system ensures stronger read consistency since more replicas are
checked.
For example, in a system with 5 replicas, if W = 3 and R = 3, the system ensures strong consistency
because both the read and write quorums overlap. This guarantees that at least one replica in the read
quorum will have the most recent write.
 Quorum Rules:
o R+W>NR + W > NR+W>N, where N is the total number of replicas, ensures strong
consistency.
o W>N/2W > N / 2W>N/2 ensures that every write reaches a majority of nodes.
o R>N/2R > N / 2R>N/2 ensures that reads always access the majority of up-to-date
nodes.
5. Relaxing Durability
Durability in distributed systems refers to the guarantee that once a write is acknowledged, it will
persist and survive any subsequent failures. Relaxing durability means loosening this guarantee to
improve performance or availability.
 Relaxed Durability allows a system to acknowledge a write before it has been fully persisted
across all nodes or storage devices. This can improve throughput and reduce write latency but
comes with a risk of data loss in the event of a failure.
 Trade-Off: Relaxing durability often occurs in systems that prioritize availability and
performance over strict consistency (for example, in high-throughput NoSQL databases). This
may be acceptable for systems were losing a small amount of recent data during a crash is not
critical, but it is unacceptable in systems where data integrity must be guaranteed (e.g., banking
systems).
(ii) Discuss about Materialized Views. (6)(K2)(CO2)
1. Materialized View Overview
A materialized view is similar to a regular database view in that it is based on a query that pulls data
from one or more base tables. However, unlike a normal view (which only defines the query but
doesn’t store any data), a materialized view stores the actual query results. This means that a
materialized view:

 Physically stores data derived from the base tables.


 Does not need to recompute the result set each time it’s queried, allowing for faster read
operations.
Materialized views are typically refreshed at specific intervals or on-demand to stay in sync with the
underlying data.
REG.
5113
NO. :

2. How Materialized Views Work


 Creation: A materialized view is created by executing a query (e.g., an aggregation, join, or
complex transformation) on one or more base tables. The result of this query is stored in the
database.
 Data Storage: The results are stored in a physical table, so when a user queries the materialized
view, the database reads the precomputed results instead of executing the underlying query
again.
 Refresh Mechanisms:
o Complete Refresh: The entire materialized view is recomputed and rewritten with fresh
data from the base tables. This can be costly for large datasets.
o Incremental (Fast) Refresh: Only the changes (inserts, updates, deletes) that have
occurred in the base tables since the last refresh are applied to the materialized view.
This is more efficient but requires tracking changes in the base tables (often through
logging mechanisms).
3. Benefits of Materialized Views
 Improved Query Performance: Since the data is precomputed and stored, querying a
materialized view is much faster than running a complex query on the base tables, especially
for operations like joins, aggregations, and filtering.
 Reduced Load on Base Tables: Materialized views help offload query processing from the
underlying base tables, which is particularly beneficial in systems where large datasets or
frequent queries would otherwise cause high load on the base tables.
 Support for Complex Queries: Materialized views are especially useful for storing the results
of computationally expensive operations, such as:
o Aggregations (e.g., SUM, COUNT, AVG).
o Multi-table joins.
o Data transformations (grouping, filtering, etc.).
 Snapshot of Data: A materialized view can act as a snapshot of the data at a certain point in
time, which is useful for reporting and historical analysis.
4. Challenges with Materialized Views
 Staleness of Data: Materialized views are static until they are refreshed. If the underlying base
tables change frequently, there’s a risk that the materialized view may not reflect the latest data
unless refreshed promptly. This can lead to data inconsistency.
 Maintenance Overhead: Maintaining and refreshing materialized views adds complexity and
overhead to the system. Deciding the refresh frequency (manual, scheduled, or event-based)
involves a trade-off between freshness and performance.
 Storage Costs: Since the materialized view stores data physically, it consumes additional
storage space. This could be significant for large datasets or frequent refresh operations.
 Refresh Performance: For very large datasets, refreshing the materialized view (especially full
refreshes) can be computationally expensive and time-consuming.
5. Materialized View Refresh Strategies
 Manual Refresh: The materialized view is updated only when explicitly told to refresh by the
user or application.
 Scheduled (Periodic) Refresh: The materialized view is refreshed at regular intervals, such as
hourly, daily, or weekly. This is common in systems where real-time data is not critical.
REG.
5113
NO. :

 Trigger-Based (Event-Driven) Refresh: The materialized view is refreshed when specific


events occur, such as an update or insert in the base tables. This can be done using database
triggers or similar mechanisms.
 Fast Refresh: Fast refreshes are used to update the materialized view with only the incremental
changes (rather than refreshing the entire dataset). Fast refresh requires that the database tracks
the changes to the base tables (often using logs).
9. What is NoSQL? What are the advantages of NoSQL? Explain the types of NoSQL
Databases. (16)(K2)(CO2)
NoSQL is a type of database management system (DBMS) that is designed to handle
and store large volumes of unstructured and semi-structured data. Unlike traditional
relational databases that use tables with pre- defined schemas to store data, NoSQL
databases use flexible data models that can adapt to changes in data structures and are capable
of scaling horizontally to handle growing amounts of data.
The term NoSQL originally referred to “non-SQL” or “non-relational” databases, but the
term has since evolved to mean “not only SQL,” as NoSQL databases have expanded to
include a wide range of different database architectures and data models.
Types of NoSQL database
There are multiple types of NoSQL databases. 4 of the most common NoSQL
databases are:
1. Document databases: A collection of documents, where each document is
JSON or JSON-like format. Each document contains pairs of fields and values.
The primary storage is in the storage layer and we cache it out to memory.
Examples – MongoDB, CouchDB, Cloudant
2. Key-value stores: Key-value stores; similar to Python dictionaries. Query either by
using the key or search through the entire database. The key-value stores tend to
be used in memory and use a backing store behind it.
Examples – Memcached, Redis, Coherence
3. Wide column databases: Similar to relational database tables; the difference is
the storage on the backend is different. We can put SQL on top of a wide column
database, which makes it very similar to querying a relational database.
Examples – Hbase, Big Table, Accumulo
4. Graph databases: Stores data as nodes (vertices) and relationships (edges).
Vertices typically store object information while edges represent the
relationships between nodes. We can have a SQL-like query language in our
graph databases. Examples – Amazon Neptune, Neo4j
NoSQL databases are often used in applications where there is a high volume of data
that needs to be processed and analyzed in real-time, such as social media analytics, e-
commerce, and gaming. They can also be used for other applications, such as content
management systems, document management, and customer relationship management.
NoSQL originally referring to non-SQL or non-relational is a database that provides a
mechanism for storage and retrieval of data. This data is modeled in means other than the
tabular relations used in relational databases. Such databases came into existence in the
late 1960s, but did not obtain the NoSQL moniker until a surge of popularity in the early
twenty-first century. NoSQL databases are used in real-time web applications and big data
and their use are increasing over time.
REG.
5113
NO. :

• NoSQL databases, also known as “not only SQL” databases, are a new type of
database management system that has, gained popularity in recent years. Unlike
traditional relational databases, NoSQL databases are designed to handle large
amounts of unstructured or semi-structured data, and they can accommodate
dynamic changes to the data model.
• This makes NoSQL databases a good for modern web applications, real-time
analytics, and big data processing. NoSQL databases are a relatively new type of
database management system that has a gained popularity in recent years due to
their scalability and flexibility. They are designed to handle large amounts of
unstructured or semi-structured data and can handle dynamic changes to the data
model. This makes NoSQL databases a good web applications, real-time analytics,
and big data processing.
Key Features of NoSQL:
1. Dynamic schema: NoSQL databases do not have a f i xed schema and can
accommodate changing data structures without the need for migrations or schema
alterations.
2. Horizontal scalability: NoSQL databases are designed to scale out by adding
more nodes to a database cluster, making them well- suited for handling large
amounts of data and high levels of traffic.
3. Document-based: Some NoSQL databases, such as MongoDB, use a
document-based data model, where data is stored in a scales semi-structured
format, such as JSON or BSON.
4. Key-value-based: Other NoSQL databases, such as Redis, use a key-value data
model, where data is stored as a collection of key- value pairs.
5. Column-based: Some NoSQL databases, such as Cassandra, use a column-
based data model, where data is organized into columns instead of rows.
6. Distributed and high availability: NoSQL databases are often designed to be
highly available and to automatically handle node failures and data replication
across multiple nodes in a database cluster.
7. Flexibility: NoSQL databases allow developers to store and retrieve data in a
f l exible and dynamic manner, with support for multiple data types and changing
data structures.
8. Performance: NoSQL databases are optimized for high performance and can
handle a high volume of reads and writes, making them suitable for big data and
real-time applications.
Advantages of NoSQL: There are many advantages of working with NoSQL
databases such as MongoDB and Cassandra. The main advantages are high scalability
and high availability.
1. Schema Agnostic: NoSQL Databases do not require any specific schema or s
storage structure than traditional RDBMS.
2. Flexibility: NoSQL databases are designed to handle unstructured or semi-
structured data, which means that they can accommodate dynamic changes to the
data model.
3. High availability: The auto, replication feature in NoSQL databases makes it
highly available because in case of any failure data replicates itself to the previous
REG.
5113
NO. :

consistent state.
4. Scalability: NoSQL databases are highly scalable, which means that they can
handle large amounts of data and traffic with ease. This makes them a good for
applications that need to handle large amounts of data or traffic.
5. Performance: NoSQL databases are designed to handle large amounts of data
and traffic, which means that they can offer improved performance compared to
traditional relational databases.
6. Cost-effectiveness: NoSQL databases are often more cost- effective than
traditional relational databases, as they are typically less complex and do not
require expensive hardware or software.
Disadvantages of NoSQL: NoSQL has the following disadvantages.
1. Lack of standardization: There are many different types of NoSQL
databases, each with its own unique strengths and weaknesses. This lack of
standardization can make it difficult to choose the right database for a specific
application.

2. Lack of ACID compliance: NoSQL databases are not fully ACID- compliant,
which means that they do not guarantee the consistency, integrity, and
durability of data. This can be a drawback for applications that require
strong data consistency guarantees.
3. Narrow focus: NoSQL databases have a very narrow focus as it is mainly designed
for storage but it provides very little functionality. Relational databases are a better
choice in the field of Transaction Management than NoSQL.
4. Open-source: NoSQL is an open-source database. There is no reliable standard
for NoSQL yet. In other words, two database systems are likely to be unequal.
5. Lack of support for complex queries: NoSQL databases are not designed to
handle complex queries.
6. Lack of maturity: NoSQL databases are relatively new and lack the maturity
of traditional relational databases. This can make them less reliable and less secure
than traditional databases.
7. Management challenge: The purpose of big data tools is to make the
management of a large amount of data as simple as possible. But it is not so
easy. Data management in NoSQL is much more complex than in a relational
database. NoSQL, in particular, has a reputation for being challenging to install and
even more hectic to manage on a daily basis.
8. GUI is not available: GUI mode tools to access the database are not flexibly
available in the market.
9. Backup: Backup is a great weak point for some NoSQL databases like
MongoDB. MongoDB has no approach for the backup of data in a consistent
manner.
10. Large document size: Some database systems like MongoDB and
CouchDB store data in JSON format. This means that documents are quite large
REG.
5113
NO. :

(BigData, network bandwidth, speed), and having descriptive key names actually
hurts since they increase the document size.
Part-C (Compulsory)
10. Provide a conclusion by presenting insights into the distinct factors that organizations should
carefully evaluate when choosing between MongoDB and Cassandra to meet the specific
requirements of their applications. Discuss the same. (16)
(K2)(CO2)

When choosing between MongoDB and Cassandra to meet the specific requirements of their
applications, organizations need to carefully evaluate several factors, each of which aligns with the
nature of their use cases, performance needs, and scalability requirements. Below are the distinct
factors that should guide the decision-making process:
1. Data Model & Use Case
 MongoDB:
o Best suited for document-oriented data (JSON-like structures). It is ideal for
applications where data is semi-structured, hierarchical, or has varying attributes, such
as content management systems, catalogues, or real-time analytics dashboards.
o MongoDB allows for schema flexibility, making it a good fit for dynamic applications
where the schema might evolve over time.

 Cassandra:
o Focuses on wide-column store data models and excels in handling large-scale data with
simple, fixed schema patterns. Cassandra is ideal for use cases involving time-series
data, IoT, real-time logging, or sensor data, where data is written once and seldom
updated.
o It is best suited for high write-throughput scenarios where vast amounts of data need to
be ingested quickly, such as messaging systems or event logging.
2. Consistency Model
 MongoDB:
o Strong consistency by default on a per-document level. This makes MongoDB a good
choice when applications require strict consistency guarantees and prefer having the
latest version of the data for each read.
o Replica sets enable high availability and automatic failover, but consistency trade-offs
occur when using MongoDB’s sharding mechanisms.
 Cassandra:
o Eventual consistency by default. However, Cassandra provides tunable consistency
levels, allowing organizations to adjust the consistency guarantees based on their
application’s requirements (e.g., QUORUM, ONE, or ALL reads/writes).
o Best suited for use cases where high availability and partition tolerance are more
important than strong consistency, such as social media platforms or distributed sensor
networks.
3. Scalability and Performance
 MongoDB:
REG.
5113
NO. :

o Horizontal scaling is possible through sharding (distributing data across multiple


servers), but managing shard balancing can add complexity.
o MongoDB excels in read-heavy workloads, especially when indexed properly.
However, it may not scale as efficiently for write-intensive workloads as Cassandra.
o Ideal for applications that prioritize flexible query patterns and require secondary
indexes.
 Cassandra:
o Designed for massive horizontal scaling, Cassandra is a distributed database built to
handle large amounts of data across multiple data centers with no single point of failure.
o It is optimized for write-heavy applications that require constant, high-speed data
ingestion and minimal latency across geographically distributed clusters.
o Cassandra is highly performant for sequential writes and can handle workloads that
require scaling across thousands of nodes.
4. High Availability and Fault Tolerance
 MongoDB:
o MongoDB provides replication through replica sets, where one node acts as the primary
and the others act as secondaries. If the primary node fails, an automatic failover
process elects a new primary.
o This setup works well for applications that require single-region deployments or can
tolerate a bit more complexity in multi-region settings.
 Cassandra:
o Designed with built-in fault tolerance and high availability in mind, Cassandra offers
true multi-region support with the ability to replicate data across multiple data centers.
Even in the event of node failures, it continues to function without data loss.

o No master-slave architecture, which means that all nodes are equal and any node can
accept read and write requests, ensuring continuous availability even during failures.
5. Querying & Indexing
 MongoDB:
o MongoDB provides rich query capabilities, including filtering, aggregation, and
secondary indexes, making it highly flexible for ad hoc queries.
o MongoDB’s text-based search and geospatial querying make it suitable for applications
that require complex search and filtering, such as e-commerce, social networks, and
content-heavy platforms.
 Cassandra:
o Cassandra’s querying capabilities are more limited, focusing on read and write
operations by partition key. It does not natively support complex queries or secondary
indexes as effectively as MongoDB.
o It’s more suitable for use cases that involve predictable access patterns and where data
is retrieved based on primary key lookups (e.g., time-series data or logs).
6. Consistency, Availability, and Partition Tolerance (CAP Theorem)
 MongoDB:
o Prioritizes Consistency and Partition Tolerance but can sacrifice Availability in some
configurations (especially in partitioned networks).
REG.
5113
NO. :

oOrganizations should consider MongoDB when strong consistency is crucial and data
updates need to be reflected immediately.
 Cassandra:
o Prioritizes Availability and Partition Tolerance at the expense of strong consistency.
Eventual consistency allows it to offer high availability and low-latency reads and
writes.
o Best for applications that require continuous availability across globally distributed data
centres, where downtime is not an option (e.g., telecommunications, finance, and large-
scale web services).
7. Community, Ecosystem, and Support
 MongoDB:
o Has a strong and active community with broad enterprise support through MongoDB
Atlas and MongoDB Enterprise Edition. MongoDB offers a wide range of tools for
management, monitoring, and security.
o MongoDB is highly suitable for organizations that need professional support, managed
services, or enterprise-grade features such as advanced security, compliance, and real-
time analytics.
 Cassandra:
o Supported by Apache Cassandra and DataStax, with a strong community focused on
large-scale deployments. It is open-source, and commercial support is available through
DataStax.
o Best suited for organizations that have experienced database teams and need a highly
customizable, open-source solution with the ability to fine-tune deployments at scale.

You might also like