Big Data
Big Data
Q1. (a) What is xPath? Explain the data model for xPath with suitable
example. Briefly explain about the qualifier im Path
(b) Explain Skolem Function in the context of pattern matching in XML.
query language with suitable examples.
(c) Briefly explain the Pattern matching based implementation in
Semi-Structured Data query languages 5-5-2
Answer:
(a) What is XPath? Explain the data model for XPath with a suitable
example. Briefly explain the qualifier in XPath.
XPath:
XPath (XML Path Language) is a query language used for selecting nodes from an XML
document. It provides a way to navigate through elements and attributes in an XML document
by specifying paths. XPath is widely used in conjunction with XSLT (Extensible Stylesheet
Language Transformations), XQuery, and other XML technologies.
XPath operates on the XML data model, which treats an XML document as a tree structure.
Each node in this tree can be an element, attribute, text, comment, or processing
instruction. The root of the tree is the root element of the document, and each element can
have children (sub-elements).
Example:
xml
Copy code
<bookstore>
<book>
<title>XML Basics</title>
<author>John Doe</author>
<price>29.99</price>
</book>
<book>
<title>XPath for Beginners</title>
<author>Jane Smith</author>
<price>19.99</price>
</book>
</bookstore>
● /bookstore/book/title: This selects all title elements that are children of book
elements within the bookstore element.
Qualifier in XPath:
A qualifier in XPath is used to filter and select nodes that satisfy a specific condition. It is often
specified in predicates (enclosed in square brackets), and it allows us to narrow down the set of
nodes that match a given path.
For example:
Here, [price>20] and [author='John Doe'] are qualifiers that filter the nodes based on
specific conditions.
Skolem Function:
In the context of pattern matching, Skolem functions are used to handle existential variables
that are introduced in the process of query formulation, especially in XML query languages and
databases. They are typically used in rule-based systems and in logic programming to
convert existential quantifiers into a form that can be processed in a query.
In XML querying, when you want to find or match nodes based on certain patterns, you may
encounter situations where you need to represent variables whose values do not exist explicitly
in the XML document. Skolemization replaces these existential variables with Skolem functions,
allowing queries to be processed more efficiently.
Example:
xml
Copy code
<library>
<book>
<title>XML Basics</title>
<author>John Doe</author>
</book>
<book>
<title>XPath for Beginners</title>
<author>Jane Smith</author>
</book>
</library>
If we wanted to query for books written by authors whose names are not known, we could use a
Skolem function to generate a fresh constant for each author.
xml
Copy code
book[author="Skolem:author(1)"]/title
In this case, Skolem:author(1) represents a fresh function that can match any author
dynamically, and the query is flexible enough to return results even when the author values are
unknown at query time.
Semi-structured data (such as XML, JSON, or NoSQL databases) lacks a fixed schema and
does not follow a rigid, table-like structure. Query languages for semi-structured data often rely
on pattern matching to retrieve relevant data. Pattern matching in this context is a method of
defining patterns that describe the structure and content of the data you're interested in, and
then querying to find data that fits those patterns.
● Specifying node relationships: Defining how elements are related or how nodes
should be traversed (e.g., child, parent, descendant).
● Filters/conditions: Using predicates or conditions to restrict the result set (e.g., filtering
based on attribute values or text content).
● Wildcards: Allowing for flexible querying when parts of the structure are unknown or
vary.
Example:
xml
Copy code
<store>
<product id="101">
<name>Shampoo</name>
<price>5.99</price>
</product>
<product id="102">
<name>Conditioner</name>
<price>6.99</price>
</product>
</store>
xml
Copy code
//product[name="Shampoo"]/price
This will return the price of the product where the name is "Shampoo."
In semi-structured data query languages, patterns are often specified using XPath, XQuery, or
other query languages that support wildcard and hierarchical traversal. These languages
provide flexible tools to describe patterns and filter results based on conditions or values.
Summary:
● XPath is a language for querying and selecting elements from XML documents, and its
data model is tree-based, where elements, attributes, and text form nodes.
● The qualifier in XPath is used to filter nodes based on specific conditions.
● Skolem functions in pattern matching in XML query languages handle existential
variables by introducing fresh constants.
● Pattern matching in semi-structured data query languages is essential for querying data
where the structure is flexible and not strictly predefined, allowing for dynamic and
content-based querying.
Q 2.
(a) Briefly explain shout Resource Description Format (RDF), What are
the usage of RDF?
(b) Will the adoption of Big Data have any impact on day to day
business operations?
(c) How big data are generated? Explain How a has affected the data
modelling techniques. What are the challenges in handling big data"
Answer:
(a) Briefly explain Resource Description Framework (RDF). What are the
usage of RDF?
RDF is a framework used for representing structured information about resources in the World
Wide Web. It is a standard developed by the World Wide Web Consortium (W3C) to provide a
way to describe relationships between objects (resources) in the form of
subject-predicate-object triples. Each triple consists of:
In RDF, a resource can be anything identified by a URI (Uniform Resource Identifier), and the
predicate is also a URI, which represents a property of the resource. The object can either be
a literal value (such as a number or string) or another resource.
Example:
This RDF triple states that the book with the URI http://example.org/book123 has the
title "The Great Gatsby".
Usage of RDF:
1. Linked Data: RDF allows data to be interlinked across different sources, facilitating the
creation of the "Web of Data" where information from various databases can be linked
and accessed in a standardized way.
2. Semantic Web: RDF forms the backbone of the Semantic Web, where data is annotated
with meaning (semantics) to enable machines to understand and process information
automatically.
3. Metadata Representation: RDF is used for representing metadata about resources,
such as the author of a book, the publisher, and other properties.
4. Knowledge Graphs: RDF is used in creating knowledge graphs, which are structured
representations of entities and the relationships between them (e.g., Google’s
Knowledge Graph).
5. Data Integration: RDF can integrate data from multiple heterogeneous sources, making
it useful in various applications like bioinformatics, digital libraries, and more.
(b) Will the adoption of Big Data have any impact on day-to-day business
operations?
Yes, the adoption of Big Data will significantly impact day-to-day business operations in several
ways:
However, businesses must also deal with challenges like data privacy concerns, the need for
specialized skills, and the significant infrastructure costs associated with handling large volumes
of data.
(c) How Big Data are generated? Explain how Big Data has affected the
data modeling techniques. What are the challenges in handling Big Data?
1. Social Media: Platforms like Facebook, Twitter, Instagram, etc., produce vast amounts
of data in the form of posts, comments, images, videos, etc.
2. Sensor Data: IoT (Internet of Things) devices, such as sensors in smart homes,
wearables, industrial machinery, and vehicles, generate continuous streams of data.
3. Transaction Data: Online transactions, financial data, and retail data are constantly
being generated through purchases, payments, and other business activities.
4. Web Data: Website interactions, such as clicks, page views, search queries, and online
behaviors, produce large-scale data.
5. Log Files: Servers, networks, and applications generate logs containing detailed
information about system operations, performance, and security.
6. Multimedia: Audio, video, and images captured from cameras, smartphones, and other
devices contribute to Big Data.
7. Public Data: Government databases, research publications, and open data initiatives
are valuable sources of Big Data.
Big Data has significantly influenced traditional data modeling techniques due to the scale,
complexity, and variety of data. Key impacts include:
1. Schema Flexibility: Unlike structured data (e.g., relational databases), Big Data often
involves unstructured or semi-structured data (e.g., JSON, XML, or text). This requires
more flexible and dynamic data models that can handle data changes over time.
2. NoSQL Databases: Traditional relational databases, with fixed schemas, are often not
suitable for Big Data. NoSQL databases (e.g., MongoDB, Cassandra) have gained
popularity for storing large volumes of unstructured data. These databases use flexible
schema designs (document-based, key-value pairs, etc.) that allow for more scalability
and performance.
3. Distributed Data Models: Big Data is typically stored and processed across distributed
systems (e.g., Hadoop, Spark). As a result, data models must accommodate distributed
storage and parallel processing.
4. Data Lake: Traditional data warehouses, which rely on structured data, are being
replaced by data lakes. Data lakes store raw, unprocessed data from multiple sources,
allowing for more flexibility and scalability, but they also require sophisticated tools for
data governance and processing.
1. Data Volume: Managing and storing the sheer volume of data generated every second
poses a significant challenge. Traditional storage solutions often cannot scale effectively
to handle this growth.
2. Data Variety: Big Data comes in many forms, including structured, unstructured, and
semi-structured data. Storing, processing, and analyzing this diverse data requires
different technologies and tools.
3. Data Velocity: The speed at which data is generated (real-time or near real-time) makes
it challenging to process and analyze it in a timely manner. Real-time analytics platforms
are needed for fast decision-making.
4. Data Quality: The vast volume of Big Data often contains errors, inconsistencies, and
noise. Ensuring data quality and cleaning data before analysis is a significant challenge.
5. Security and Privacy: Storing and processing Big Data can lead to privacy concerns,
especially with personal or sensitive data. Implementing robust security measures and
complying with data privacy regulations (e.g., GDPR) is crucial.
6. Infrastructure and Scalability: Big Data requires powerful hardware and scalable cloud
infrastructure to store, manage, and process large datasets. Ensuring that the system
can scale with the growing data volume is critical.
7. Skill Gaps: The complexity of Big Data technologies requires specialized skills in data
science, machine learning, distributed systems, and data engineering. Organizations
often face challenges in hiring and retaining skilled professionals.
8. Integration: Integrating Big Data with legacy systems and databases can be complex
and costly, especially when the data comes from multiple sources.
In summary, Big Data presents both exciting opportunities and significant challenges in terms of
its generation, storage, analysis, and management. Advanced technologies, proper
infrastructure, and skilled professionals are required to fully harness its potential while
overcoming these challenges.
Q(3) Explam the implication of Hadoop Framework in the context of
Big Data. Draw a critical comparison between Relational Dataluse
Management System and Hadoop based data management system
(b) Explain the characteristics of Big data Eco system.
(C) Explain the concepts of mapper and reducer in Map-Reduce
techniques with suitable example
5-3-4
Answer:
(a) Explain the implication of the Hadoop Framework in the context of Big
Data. Draw a critical comparison between Relational Database Management
System (RDBMS) and Hadoop-based Data Management System.
Hadoop is an open-source framework for storing and processing large volumes of data in a
distributed environment. It was designed to handle Big Data, which is characterized by the
three Vs: Volume, Variety, and Velocity. Hadoop allows businesses to store vast amounts of
data across many machines and provides tools for processing and analyzing that data.
● Hadoop Distributed File System (HDFS): A distributed file system designed to store
large files across multiple machines.
● MapReduce: A programming model for processing large datasets in parallel across a
Hadoop cluster.
● YARN (Yet Another Resource Negotiator): Manages and schedules resources for the
various applications running on Hadoop.
● Hadoop Ecosystem: Other tools and frameworks built on top of Hadoop (e.g., Hive, Pig,
HBase) that facilitate querying, analyzing, and managing Big Data.
1. Scalability: Hadoop can scale horizontally by adding more machines to the cluster,
allowing organizations to handle massive amounts of data. This is a significant
advantage over traditional data storage solutions, which often rely on vertical scaling
(adding more power to a single machine).
2. Fault Tolerance: HDFS ensures that data is replicated across multiple nodes, providing
high availability and fault tolerance. If a node fails, the data remains accessible from
another node.
3. Cost-Effectiveness: Hadoop leverages commodity hardware, which makes it more
affordable than traditional RDBMS solutions that require expensive, high-performance
hardware.
4. Flexibility: Hadoop can process structured, semi-structured, and unstructured data,
making it versatile for a variety of data types, from transactional data to social media
content, logs, and multimedia.
5. Parallel Processing: The MapReduce model allows for distributed processing of large
datasets by breaking tasks into smaller chunks that can be processed concurrently
across different nodes in the cluster.
Data Storage Data is stored in tables with Data is stored in HDFS across
predefined schemas distributed systems
Fault Limited fault tolerance, relies on High fault tolerance through data
Tolerance backups replication in HDFS
Querying SQL queries for structured data NoSQL querying (e.g., Hive, HBase)
for diverse data types
Performance Optimized for smaller datasets and Optimized for large-scale batch
transactional operations processing of Big Data
In summary, RDBMS is best suited for structured data and transactional workloads, whereas
Hadoop-based systems are optimized for handling Big Data, offering scalability, flexibility, and
cost-efficiency for distributed data processing.
(b) Explain the characteristics of the Big Data Ecosystem.
The Big Data Ecosystem refers to the collection of technologies and tools that work together to
store, process, analyze, and manage Big Data. It encompasses a variety of components that
interact to meet the needs of large-scale data processing. Key characteristics of the Big Data
Ecosystem include:
1. Data Storage:
○ Distributed Storage: Big Data is often stored across multiple machines to
ensure scalability and fault tolerance. Hadoop's HDFS (Hadoop Distributed File
System) is a common storage solution.
○ Data Lakes: A central repository that stores raw data in any format, which can be
processed later (structured, unstructured, or semi-structured).
○ NoSQL Databases: Databases like HBase, Cassandra, and MongoDB are part
of the Big Data ecosystem for storing large amounts of semi-structured or
unstructured data.
2. Data Processing:
○ Batch Processing: Tools like MapReduce and Apache Spark are used for
processing large datasets in batches.
○ Real-time Processing: Tools like Apache Kafka, Apache Flink, and Apache
Storm are used for processing streaming data in real-time.
3. Data Analysis:
○ Data Mining and Machine Learning: Big Data analysis often involves using
algorithms to detect patterns, correlations, and predictive insights. Tools like
MLlib in Spark or TensorFlow for deep learning are widely used.
○ Business Intelligence Tools: Tools like Hive, Impala, and Presto allow for
SQL-like querying of data stored in Hadoop.
4. Data Governance:
○ Data Quality: Ensuring the accuracy, consistency, and reliability of data through
various data cleansing and validation tools.
○ Data Security and Privacy: Ensuring data is protected, especially in regulated
industries, through encryption, access control, and compliance measures.
5. Data Integration:
○ Tools like Apache Nifi and Apache Sqoop help integrate data from multiple
sources into a cohesive, accessible format.
6. Scalability and Fault Tolerance:
○ Big Data tools are designed to scale horizontally by adding more nodes and
ensuring that the system remains operational even when individual components
fail.
7. Tools and Frameworks:
○ The ecosystem includes several tools, such as Apache Hive, Pig, HBase,
Kafka, Apache Flink, and others that provide a range of functionalities from data
storage to analytics.
Overall, the Big Data ecosystem is highly dynamic and continues to evolve with new tools and
frameworks that address emerging challenges and use cases.
Mapper:
● The Mapper is responsible for taking input data and processing it into intermediate
key-value pairs. It performs the initial data transformation step.
● It reads input data, applies a transformation function, and emits key-value pairs.
Reducer:
● The Reducer takes the intermediate key-value pairs produced by the Mappers and
processes them further. It aggregates or combines the values for each key and
generates the final output.
Copy code
apple 1
banana 1
apple 1
banana 1
orange 1
We want to count the occurrences of each fruit (i.e., a word count program).
Mapper Function: The Mapper reads each line of the input data and emits a key-value pair
where the key is the fruit name, and the value is 1. For the input above, the Mapper would
produce the following intermediate key-value pairs:
arduino
Copy code
("apple", 1)
("banana", 1)
("apple", 1)
("banana", 1)
("orange", 1)
1. The Mapper simply processes each record and outputs intermediate key-value pairs,
where the key is the fruit and the value is the number 1.
Shuffling and Sorting (Intermediate Step): After the Mapper emits its key-value pairs, the
shuffle and sort phase groups all the values by key. This results in:
arduino
Copy code
("apple", [1, 1])
("banana", [1, 1])
("orange", [1])
2.
Reducer Function: The Reducer receives these grouped key-value pairs. It sums the values
for each key to get the total count for each fruit:
arduino
Copy code
("apple", 2)
("banana", 2)
("orange", 1)
3. Finally, the output is the total count for each fruit, which is the result of the MapReduce
job.
Summary of Functions:
● Mapper: Breaks down the input into key-value pairs and processes them.
● Reducer: Aggregates the values for each key and produces the final output.
This paradigm is highly parallelizable, meaning that multiple Mappers can run on different
nodes, processing chunks of data independently. The Reducer then aggregates results from all
Mappers, making MapReduce highly efficient for large-scale data processing.
Conclusion:
● Mapper handles the splitting and processing of data into key-value pairs.
● Reducer handles the aggregation of the intermediate results to produce the final output.
● MapReduce is a core component of the Hadoop ecosystem for processing Big Data in
parallel across distributed systems.
Q4. (a) Beiefly explain the batch and operational data processing in
hig data scenario with example. (b) How effectively benefits can be
achieved via parallelization in big data processing
41315
(c) (a) Explain about Distributed Hash Table in light of Key Value Store
databases. How it can handle putt and get function with proper fault
tolerance. Give example.
Answer:
(a) Briefly explain the batch and operational data processing in Big Data
scenarios with examples.
Batch data processing refers to the processing of large volumes of data in chunks or "batches,"
usually on a scheduled or periodic basis, rather than in real time. It is suited for situations where
immediate processing is not required, and data can be accumulated over time before being
processed.
Key Characteristics:
● Latency: Processing is done in intervals (e.g., hourly, daily) and involves large datasets.
● Efficiency: Batch processing is optimized for large-scale data operations that do not
need immediate results.
● Complexity: Typically involves complex data transformations, aggregations, and
business logic.
Example: An example of batch data processing is a retail company collecting transaction data
throughout the day and running a batch job at midnight to update the inventory database,
calculate daily sales totals, and generate reports. Tools like Hadoop MapReduce or Apache
Spark are often used for such tasks in a Big Data environment.
Operational data processing (or real-time data processing) refers to the immediate processing
of data as it is generated. It focuses on providing real-time insights and actions, which is
essential for applications that require up-to-the-minute or live data updates.
Key Characteristics:
(b) How effectively benefits can be achieved via parallelization in Big Data
processing
Parallelization is a key concept in Big Data processing that helps to leverage multiple computing
resources to perform tasks simultaneously, making the process much more efficient and
scalable. Big Data processing systems like Hadoop and Spark rely on parallel processing
techniques to handle the large volume, velocity, and variety of data.
Example: Consider a log analysis task, where a company needs to process terabytes of log
data to detect system anomalies. Using parallelization, the log files are split into smaller chunks,
each of which is processed independently by different machines in a cluster. This allows for fast,
efficient processing of the entire dataset in a fraction of the time compared to sequential
processing.
(c) (a) Explain about Distributed Hash Table in light of Key-Value Store
databases. How it can handle Put and Get function with proper fault
tolerance. Give example.
A Distributed Hash Table (DHT) is a decentralized, distributed system used to store and
retrieve key-value pairs. It is often used in systems like Key-Value Stores (e.g., Cassandra,
Riak, DynamoDB) to enable fast lookups, even in large-scale distributed systems. DHTs allow
data to be stored across many nodes (servers), with the distribution of data determined by the
hash of the key.
1. Hash Function: A hash function is used to map a key to a specific location in a large
distributed system. The key is hashed into a numeric value, which determines where the
corresponding data is stored.
2. Partitioning: In a DHT, the hash space is divided among multiple nodes. Each node is
responsible for a range of the hash values. When a key is hashed, it is mapped to the
appropriate node responsible for storing that key-value pair.
3. Lookup and Retrieval: When performing a GET operation, the key is hashed, and the
system knows exactly where to find the corresponding value, which is retrieved from the
appropriate node.
4. Fault Tolerance: In DHT-based systems, fault tolerance is achieved through replication.
Each piece of data is replicated to multiple nodes, ensuring that if one node fails, the
data can still be retrieved from other replicas. This redundancy ensures high availability
and reliability of the data.
DHTs are designed to handle node failures gracefully. Here’s how fault tolerance is achieved:
● Replication: Data is replicated across multiple nodes in the system. If one node goes
down, another replica of the data on a different node can be used to serve the request.
● Consistency and Quorum: Many DHT-based systems use a quorum-based approach,
where a majority of nodes must agree on a data operation (either PUT or GET). This
ensures that the data is consistent and available, even in the case of failures.
● Dynamic Node Joining and Leaving: In DHTs, nodes can join or leave the system
dynamically. The system rehashes and redistributes the data among the new set of
nodes, ensuring the system remains balanced and fault-tolerant.
DynamoDB is a NoSQL key-value store that uses principles of DHT to ensure scalability and
availability. When a key-value pair is inserted (PUT), DynamoDB stores the data across multiple
nodes, replicating it to ensure fault tolerance. When a GET request is made for a particular key,
the system hashes the key and directs the request to the node responsible for that key,
retrieving the associated value. If the node is unavailable, DynamoDB can retrieve the data from
another replica.
Example in Action:
1. PUT Operation:
○ Key: user123
○ Value: { "name": "John Doe", "email": "[email protected]"
}
○ Hash the key user123 using the hash function.
○ Store the key-value pair on the node determined by the hash value.
○ Replicate the data to two other nodes for fault tolerance.
2. GET Operation:
○ Key: user123
○ Hash the key user123 and determine which node is responsible.
○ Retrieve the data from the node (or any replica if the node is down).
Fault Tolerance:
If one of the nodes goes down, the system can still access the data from the replicas stored on
other nodes, ensuring that the PUT and GET operations remain available.
Conclusion:
● Distributed Hash Tables (DHTs) are an efficient and fault-tolerant way to handle
key-value store databases in distributed systems. They allow for efficient data lookup
and retrieval with built-in mechanisms for fault tolerance, such as replication and
dynamic partitioning.
● Fault tolerance in DHTs ensures that the system can recover from node failures without
losing data, making it highly reliable for large-scale distributed systems.
Q5
(a) Write a short note on. (1) CAP theory (2) BASE, theory with
suitable example,
th) Compare critically the row store and column store in the context of
data storage. Explain Join Indexing system and Compression in
column based databases with suitable example
Answer:
The CAP Theorem (also known as Brewer's Theorem), proposed by Eric Brewer in 2000,
states that a distributed data store can only guarantee two out of the following three properties
at any given time:
1. Consistency (C): All nodes in the system see the same data at the same time. Every
read operation will return the most recent write, ensuring that no outdated information is
returned.
2. Availability (A): Every request (read or write) will receive a response, even if some
nodes are down. The system remains operational for all requests, even in the face of
failures.
3. Partition Tolerance (P): The system will continue to function, even if there are network
partitions (communication failures between nodes). In other words, the system can still
process requests even when parts of the network are inaccessible.
According to CAP, systems must choose between which two of the three properties they
prioritize, often making trade-offs based on the system's requirements.
BASE is an acronym that stands for Basically Available, Soft state, and Eventually
consistent, which is a set of principles used in the design of highly available and fault-tolerant
distributed databases. BASE is often used as an alternative to the ACID (Atomicity, Consistency,
Isolation, Durability) properties of traditional relational databases, especially in distributed and
NoSQL databases.
1. Basically Available (BA): The system guarantees availability, meaning the system will
always respond to requests, even if the response may not be the most up-to-date data.
2. Soft State (S): The system's state is not guaranteed to be consistent at all times. The
state of the system can change over time, even without new inputs, allowing for eventual
consistency.
3. Eventually Consistent (E): The system ensures that data will eventually be consistent
across all nodes, but there is no guarantee of consistency at any given moment.
Updates will propagate to all nodes eventually, but not immediately.
Example of BASE:
● Amazon DynamoDB follows the BASE model. When a request is made, it ensures
availability by returning a response, even if the data might not yet be consistent across
all replicas. Eventually, all replicas will synchronize and become consistent, but in the
interim, the system may return stale data.
Difference between BASE and ACID: ACID guarantees consistency and correctness at the
moment of a transaction (ideal for relational databases), while BASE sacrifices immediate
consistency in favor of scalability and availability in distributed systems (ideal for NoSQL
databases like Cassandra, MongoDB, etc.).
Row Store and Column Store are two primary ways to organize and store data in databases.
The choice of using row-based or column-based storage depends on the type of operations that
need to be performed on the data.
● Row Store:
○ Storage: In a row-oriented database, data is stored row by row. Each row
contains all the values for a record, and all columns for a given row are stored
together.
○ Use Case: Row stores are better suited for transactional applications (OLTP),
where entire records are frequently read, updated, or written.
○ Examples: MySQL, PostgreSQL, and Oracle databases use row-based
storage.
● Advantages:
○ Efficient for read and write operations that involve entire records.
○ Better for OLTP workloads where individual records are retrieved or modified.
● Disadvantages:
○ Less efficient for analytical queries that only need to read specific columns of a
large dataset.
● Column Store:
○ Storage: In a columnar database, data is stored column by column. All values for
a given column are stored together, rather than being stored row by row.
○ Use Case: Column stores are optimized for read-heavy analytical workloads
(OLAP), where queries often need to aggregate data from specific columns.
○ Examples: Apache HBase, Google Bigtable, and Amazon Redshift use
column-based storage.
● Advantages:
○ High performance for queries that need to read large datasets but only a few
columns.
○ Excellent compression because data in a column is often similar (e.g., all values
in a column are of the same type).
● Disadvantages:
○ Less efficient for transactional workloads, especially when entire records need to
be updated or read.
○ Complex to implement and manage in some cases.
Data Stored row by row (one record per Stored column by column (data for one
Storage row). column stored together).
Best for OLTP systems (e.g., banking, OLAP systems (e.g., data analytics,
order processing). reporting).
Performa Efficient for CRUD operations on Efficient for reading large datasets with
nce individual rows. fewer columns.
Compres Less efficient (heterogeneous data Highly efficient (similar values in a column).
sion in a row).
Example MySQL, PostgreSQL, SQL Server Apache HBase, Google Bigtable, Amazon
DBs Redshift
Join Indexing:
Join Indexing helps speed up queries that involve joining tables. Instead of performing a full
scan of both tables, the system creates an index that directly stores the relationship between
rows in the two tables. This index reduces the need for expensive joins by keeping track of
matching keys between tables.
Example: In a typical SQL JOIN operation, an index on the foreign key in the child table (e.g.,
Orders.CustomerID in the Orders table) can speed up the join with the Customers table:
sql
Copy code
SELECT Customers.Name, Orders.OrderID
FROM Customers
JOIN Orders ON Customers.CustomerID = Orders.CustomerID;
● If an index exists on CustomerID in both tables, the database can use this index to
quickly find the relevant matching rows rather than scanning both tables entirely.
Q6.
(al Compare Document data model and Relational data model with
suitable example
(b) Describe the difference between Embedded Document and
Referenced Document in the context of Document Oriented Database.
Give example.
(c) Explain about Quarum Consensus mechanism to provide strong
consistency in Key Value store database system. Give example
Answer:
(a) Compare Document Data Model and Relational Data Model with Suitable
Example
The Relational Data Model is the foundation of traditional relational databases like MySQL,
PostgreSQL, and Oracle. It organizes data into tables (also called relations) with rows and
columns, where each row represents a record, and each column represents an attribute of the
record.
● Data Structure: Data is stored in tables, and each table consists of rows (tuples) and
columns (attributes).
● Schema: The schema is predefined, meaning that the structure of the data (the tables
and columns) is fixed before data insertion. The schema defines how data should be
organized.
● Relationships: Tables are linked together using foreign keys, which represent
relationships between data stored in different tables. Relationships can be one-to-one,
one-to-many, or many-to-many.
● Example: Consider a simple relational database with two tables: Customers and
Orders.
Customers Table:
Custome Na Address
rID me
●
The CustomerID in the Orders table is a foreign key linking the Orders table to the
Customers table.
● Advantages:
○ Structured and normalized data for efficient querying.
○ ACID (Atomicity, Consistency, Isolation, Durability) compliance ensures data
integrity.
● Disadvantages:
○ Difficult to scale horizontally.
○ Fixed schema can be inflexible for unstructured data.
The Document Data Model is the foundation of NoSQL databases like MongoDB, CouchDB,
and RavenDB. It stores data in a document-like format, typically JSON, BSON, or XML, which
allows for a more flexible, schema-less structure compared to relational models.
● Data Structure: Data is stored in documents, which are collections of key-value pairs,
and may also contain nested structures (arrays, objects). Documents are grouped into
collections.
● Schema: The schema is flexible, meaning each document in a collection can have a
different structure. This allows for easy modification or addition of new fields without
affecting other documents.
● Relationships: Relationships between documents can be established through
embedding or referencing. Embedding is where related documents are stored within a
single document, while referencing involves using an ID to link documents across
collections.
Example: Consider a document database with collections for Customers and Orders.
Customer Document (in Customers collection):
json
Copy code
{
"CustomerID": 1,
"Name": "Alice",
"Address": "123 Main St.",
"Orders": [
{"OrderID": 101, "Product": "Laptop", "Amount": 1200},
{"OrderID": 102, "Product": "Smartphone", "Amount": 800}
]
}
Order Document (in Orders collection):
json
Copy code
{
"OrderID": 101,
"CustomerID": 1,
"Product": "Laptop",
"Amount": 1200
}
●
● Advantages:
○ Flexible schema allows for dynamic changes in structure without schema
migrations.
○ Suitable for hierarchical or nested data, reducing the need for complex joins.
● Disadvantages:
○ May not enforce data consistency across documents (depends on the database).
○ Potential for data duplication if embedding is used excessively.
(b) Difference Between Embedded Document and Referenced Document in
Document-Oriented Database
In document-oriented databases, there are two main ways to represent relationships between
documents: Embedded Documents and Referenced Documents.
Embedded Document:
An Embedded Document is a way of storing related data inside a single document. In this
approach, one document (child) is included within another (parent) document. This method is
useful when the related data is frequently accessed together.
● Usage: Ideal when the data is often read together and does not require updates to be
propagated across documents.
● Advantages:
○ Fast read performance, as all related data is stored together.
○ Simplifies the structure, especially for one-to-one or one-to-many relationships.
● Disadvantages:
○ Data duplication: If the same data is embedded in multiple documents, updates
to one part of the data must be replicated across all instances.
○ Potentially large documents: Embedding too much data can make documents
unwieldy and difficult to manage.
Example:
json
Copy code
{
"CustomerID": 1,
"Name": "Alice",
"Address": "123 Main St.",
"Orders": [
{"OrderID": 101, "Product": "Laptop", "Amount": 1200},
{"OrderID": 102, "Product": "Smartphone", "Amount": 800}
]
}
Referenced Document:
A Referenced Document stores only the reference (ID) of the related document, rather than
embedding the entire document. This method is used when data is shared among multiple
documents, or when the relationship is many-to-many.
● Usage: Ideal when the related data is large, infrequently accessed, or shared across
multiple documents.
● Advantages:
○ Avoids data duplication by keeping related documents separate.
○ Makes it easier to update or modify related documents since there is only one
copy of the data.
● Disadvantages:
○ Requires additional queries (or joins) to fetch the related data, which can impact
performance.
○ Increases complexity in managing references.
Example:
json
Copy code
{
"CustomerID": 1,
"Name": "Alice",
"Address": "123 Main St.",
"Orders": [
{"OrderID": 101},
{"OrderID": 102}
]
}
In this case, the Orders collection contains separate order documents, and each Order
document will have an OrderID, which references the orders associated with the customer.
● Quorum Write: A write operation is considered successful only when a majority of the
replicas have successfully written the data. This ensures that any subsequent reads
reflect the most recent write.
● Quorum Read: A read operation is considered successful only when a majority of the
replicas participate in the read operation, ensuring that the returned value is consistent
with the most recent write.
Example: Cassandra
Consider a Cassandra cluster with three replicas (nodes A, B, and C). When a write operation is
performed with a quorum write (2 out of 3 nodes), the data is written to two nodes, say A and
B. When a read operation is performed with a quorum read (2 out of 3 nodes), Cassandra will
query nodes A and B to ensure that the data returned is consistent with the latest write.
● If one of the nodes (say, C) is down, the system will still function as long as a quorum of
nodes (A and B) are available. However, if less than a quorum is available, the operation
may fail.
● Quorum-based approaches help prevent split-brain scenarios where different nodes
might have conflicting data.
Advantages:
● Ensures strong consistency and avoids stale reads, even in the face of network
partitions or node failures.
● Balances consistency and availability based on the configuration of the quorum.
Disadvantages:
● Performance may be affected due to the need for multiple nodes to participate in each
operation.
● It requires careful configuration to ensure that the quorum size is appropriately set for the
desired trade-off between consistency, availability, and partition tolerance.
Q7. (a) What do you mean by degree of vertex for directed graph?
Explain the handshaking theorem of graph (b) List out the
characteristics of Graph databases with example. Point out the
advantages of Graph database over Relational database (c) Write a
Short notes on ANY ONE (1) N-array Storage Model, (ii)
Decomposition Storage Model
Answer:
In a directed graph (digraph), the degree of a vertex is the number of edges connected to it.
The degree is divided into two parts:
1. In-degree: The number of edges directed towards the vertex. It counts how many edges
end at that vertex.
2. Out-degree: The number of edges directed away from the vertex. It counts how many
edges start from that vertex.
mathematica
Copy code
A → B → C
↑ ↓
D ← E
● Vertex A has an out-degree of 1 (an edge to B), and an in-degree of 0 (no edge is
coming to A).
● Vertex B has an in-degree of 1 (an edge from A) and an out-degree of 2 (edges to C and
E).
● Vertex C has an in-degree of 1 (an edge from B) and an out-degree of 0 (no edge starts
from C).
Handshaking Theorem:
● In an undirected graph, the sum of the degrees of all the vertices is twice the
number of edges.
This is because every directed edge contributes exactly one to both the in-degree of the target
vertex and the out-degree of the source vertex.
mathematica
Copy code
A → B → C
↓ ↑
D ← E
● In-degrees:
○ Vertex A: 1 (edge from D)
○ Vertex B: 2 (edges from A and E)
○ Vertex C: 1 (edge from B)
○ Vertex D: 0 (no edge points to D)
○ Vertex E: 1 (edge from D)
● Out-degrees:
○ Vertex A: 1 (edge to B)
○ Vertex B: 2 (edges to C and E)
○ Vertex C: 0 (no edge starts from C)
○ Vertex D: 1 (edge to A)
○ Vertex E: 1 (edge to B)
Sum of in-degrees = 1 + 2 + 1 + 0 + 1 = 5
Sum of out-degrees = 1 + 2 + 0 + 1 + 1 = 5
A Graph Database is designed to store and manage graph structures, where data entities are
represented as vertices (nodes) and the relationships between them are represented as
edges. It is especially useful for scenarios where relationships between entities are crucial and
complex.
● Nodes and Edges: Graph databases represent data as nodes (entities) and edges
(relationships between entities).
● Properties: Both nodes and edges can have properties associated with them, allowing
for flexible data modeling.
● Flexible Schema: Graph databases allow a schema-less or flexible schema, meaning
the structure can evolve over time without altering existing data.
● Efficient Relationship Queries: Graph databases excel at handling complex queries
involving relationships and traversals, such as finding connections between entities.
● Traversal-Based Queries: Queries in graph databases often involve graph traversal,
where the system searches through connected nodes (vertices) and edges
(relationships).
Examples:
● Neo4j: A popular graph database used for network analysis, fraud detection,
recommendation systems, etc.
● Amazon Neptune: A fully managed graph database service by AWS that supports both
property graphs and RDF models.
The N-array Storage Model is used to represent hierarchical or multi-dimensional data in a way
that allows for efficient storage and retrieval. In this model, data is organized as a tree structure
or an array of nodes, where each node can contain multiple children. This model is particularly
useful for representing complex structures such as multi-level category hierarchies, decision
trees, or geographical data.
● Representation: In the N-array model, each node can have N children (where N can
vary), allowing the tree structure to have multiple branches at each level. This is more
efficient than a binary tree or simple parent-child relationship, as each node can directly
link to many other nodes.
● Usage: Commonly used for representing multi-dimensional databases, hierarchical data,
and for scenarios where data relationships have a large number of children (e.g., XML
data structures).
Example: Consider a hierarchical file system where each directory (node) can have multiple
subdirectories (children) and files (data points). The N-array model would represent this
structure by allowing each node (directory) to contain multiple subdirectories or files.
The Decomposition Storage Model involves breaking down a complex data structure into
smaller, more manageable parts or segments. Each part (substructure) is stored separately, and
relationships between them are maintained through pointers or references. This approach is
useful when dealing with large datasets or when certain components of the data need to be
accessed or updated independently of others.
Example: In a relational database, a large customer database might be split into multiple
smaller tables (e.g., Customer, Orders, and Payments). These tables are connected via foreign
keys, ensuring that the relationships between different types of data are preserved.
In contrast to the N-array model, which represents hierarchical relationships, the decomposition
model is often used to break down data for performance optimization or logical separation.