DD Sem II Answer
DD Sem II Answer
• A File Model defines how files are structured, accessed, and managed in an
operating system.
• It includes file attributes, operations (read, write), and access permissions.
• Definition: It refers to how data is divided and stored across multiple locations in a
distributed system.
• Types:
o Horizontal Partitioning – Divides rows of data.
o Vertical Partitioning – Divides columns of data.
5. What is Parsing?
• Parsing is the process of analyzing and converting a code or query into a structured
format for execution.
• Example: Parsing an SQL query to execute commands.
Example:
8. Features of SparkSQL
• A process of ensuring all nodes in a distributed system have the same time.
• Example: Network Time Protocol (NTP)
10. Define Data Scoping.
Type Definition
Homogeneous Uses the same database software on all
nodes.
Heterogeneous Uses different database software on different
nodes.
• A technique where the query execution plan is optimized in a single central node
rather than across multiple nodes.
16 Marks
Below are detailed, structured answers for each question. These responses are designed
to be comprehensive enough for a 16‐mark answer, covering key definitions, underlying
concepts, challenges, and examples.
File Models
File models define the logical structure and organization of data in a file. They determine
how data is conceptually arranged and interrelated, which in turn affects how information
is stored and retrieved. Major file models include:
• Hierarchical Model: Organizes data in a tree-like structure where each record has
a single parent. Useful in legacy systems and for representing one-to-many
relationships.
• Network Model: Permits a record to have multiple parent and child records,
forming a graph. This model is more flexible than the hierarchical model and better
represents complex relationships.
• Relational Model: Represents data as tables (relations) with rows and columns,
emphasizing relationships through foreign keys. It is widely used because of its
simplicity and powerful query language (SQL).
• Object-Oriented Model: Uses objects, classes, and inheritance to store data,
making it suitable for applications requiring tight integration with object-oriented
programming.
File accessing models determine how a file’s data can be accessed, read, and
manipulated. The choice of model impacts the system’s performance and ease of use:
• Sequential Access: Data is read in a predetermined, linear order. It’s simple and
effective for processing files in full, but inefficient for random access.
• Direct (or Random) Access: Enables access to any part of a file without reading
preceding data. Ideal for applications where speed and immediate access are
necessary.
• Indexed Access: Uses indexes to locate data quickly. By maintaining an index
structure, this method provides a balance between sequential and random access,
improving search efficiency.
Summary:
File models focus on data structure and organization, while file accessing models
emphasize the methods by which data is read and written. Their proper design and
implementation are crucial for ensuring efficient data retrieval, storage integrity, and
system performance.
Distributed data storage involves spreading data across multiple physical locations or
nodes rather than relying on a single storage unit. This method addresses scalability,
availability, and fault tolerance in modern computing environments.
• Data Partitioning (Sharding): Data is split into fragments or shards, each stored on
a different node. This enhances performance by parallelizing queries and reducing
load on any single node.
• Replication: Multiple copies of data are maintained across different nodes.
Replication increases fault tolerance by ensuring that if one node fails, another can
supply the data without interruption.
• Consistency and Synchronization: Distributed systems must ensure that
replicated data remains consistent. Techniques such as eventual consistency,
strong consistency, and quorum-based protocols are used to maintain integrity
across nodes.
• Scalability: Adding more nodes allows the system to handle increased load and
data volume without a significant drop in performance.
• Data Locality: Efficient distributed storage systems often aim to place data near
the user or computing resource to reduce latency and improve access times.
Challenges:
Summary:
Key Challenges:
Summary:
Overview of HDFS
Hadoop Distributed File System (HDFS) is a key component of the Hadoop ecosystem
designed for high-throughput access to large datasets. It follows a master-slave
architecture to achieve reliability, scalability, and fault tolerance.
Core Components:
• NameNode (Master):
o Maintains the file system namespace, metadata, and directory structure.
o Manages file permissions and the mapping of file blocks to DataNodes.
o Acts as a single point of contact for clients during file operations.
• DataNodes (Slaves):
o Store the actual data blocks.
o Handle read/write requests from clients.
o Periodically send heartbeats and block reports to the NameNode to confirm
their status and data integrity.
• Block Storage: Files are split into large blocks (commonly 128 MB or 256 MB) that
are distributed across multiple DataNodes, facilitating parallel data processing.
• Replication: HDFS replicates data blocks (typically three copies by default) across
different DataNodes to ensure fault tolerance. If a node fails, the system can still
access data from another node.
• Fault Tolerance: The system continuously monitors DataNodes. If a node fails, the
NameNode reallocates the lost blocks and ensures that the desired replication
factor is maintained.
• High Throughput: HDFS is optimized for streaming large data sets rather than
supporting low-latency access. It is designed to deliver high aggregate throughput
for batch processing workloads.
Summary:
Fundamental Concepts:
Message Ordering:
Group Communication:
Summary:
Overview:
A Distributed Database Management System (DDBMS) manages a database that is stored
across multiple sites or nodes. It offers transparency and efficiency similar to a centralized
database while leveraging the benefits of distribution.
Key Characteristics:
Advantages:
• Fault Tolerance: With data replicated across nodes, failure of one site does not
render the entire database inoperative.
• Improved Performance: Parallel processing of queries across multiple nodes
reduces query response times.
Challenges:
• Complex Query Optimization: Must consider data location and network costs.
• Concurrency Control: Coordinating transactions across nodes can be complex
due to potential conflicts and latency.
Summary:
• Definition: All participating sites use the same DBMS software, data models, and
query languages.
• Advantages:
o Simplified integration and maintenance due to uniform technology.
o Easier to optimize queries and enforce consistency as all sites follow the
same rules.
• Example: A network of branches all using the same version of Oracle or MySQL.
• Definition: Different sites may use different DBMS products, data models, or query
languages.
• Advantages:
o Flexibility to incorporate legacy systems or specialized databases optimized
for particular tasks.
o Can integrate best-of-breed systems from different vendors.
• Challenges:
o Integration requires middleware or translation layers to reconcile differences
in data representation, schema, and query processing.
o Query optimization and data consistency become more complex because of
the underlying heterogeneity.
Key Differences:
• Uniformity: Homogeneous systems provide uniform behavior across nodes,
whereas heterogeneous systems involve diverse environments.
• Complexity: Heterogeneous databases require additional layers (e.g., data
translation, schema mapping) to facilitate communication and integration.
• Maintenance: Homogeneous systems tend to be easier to maintain and upgrade,
while heterogeneous environments may incur higher overhead in terms of
integration and consistency enforcement.
Summary:
Query processing in a DBMS (and especially in distributed systems) involves several layers
that transform a user’s SQL query into an efficient execution plan. These layers include:
1. Query Parsing:
a. Function: Converts the SQL statement into an internal representation (parse
tree) and checks for syntactical and semantic correctness.
b. Output: A validated query tree.
2. Query Optimization:
a. Function: Transforms the parse tree into various equivalent query plans.
b. Techniques: Cost-based optimization, heuristic-based transformations,
and rewriting rules are used to select the most efficient plan considering
data distribution and indexes.
c. Output: An optimized query execution plan.
3. Query Execution:
a. Function: The execution engine carries out the optimized plan by performing
operations such as scans, joins, and aggregations.
b. Distributed Context: The query may be decomposed into sub-queries
executed in parallel on different nodes, with the results aggregated at a
central point.
4. Result Integration:
a. Function: Combines outputs from various nodes, handles sorting, and
presents the final result set to the user.
Diagram:
User Query
│
▼
[Parser Layer]
│
▼
[Optimization Layer]
│
▼
[Execution Layer]
│
▼
[Result Integration]
│
▼
Final Output
Summary:
Each layer in query processing plays a crucial role—from validating and translating the
query to optimizing and executing it efficiently across distributed nodes. The layered
approach ensures modularity and allows for specialized techniques at each stage.
Concept Overview:
SQL databases use a relational model with structured schemas and tables, whereas
MongoDB is a NoSQL database that stores data in flexible, JSON-like documents. Mapping
SQL to MongoDB involves translating relational constructs into document-oriented
structures.
Mapping Elements:
• Schema Mapping:
o Tables to Collections: Each SQL table is typically mapped to a MongoDB
collection.
o Rows to Documents: Individual records (rows) in a table become
documents in the collection.
• Data Relationships:
o Joins: Relational joins are often replaced by embedding related data within a
document (denormalization) or by using references that require application-
level joins.
o Normalization vs. Denormalization: While SQL relies on normalized data to
reduce redundancy, MongoDB encourages denormalization to improve read
performance.
• Query Translation:
o SQL Queries: Standard SQL operations (SELECT, INSERT, UPDATE, DELETE)
must be reinterpreted using MongoDB’s query language.
o Aggregation Framework: Complex SQL queries involving group-by and joins
are often implemented using MongoDB’s aggregation pipeline.
• Indexing and Performance:
o Indexes: Both systems support indexing, though MongoDB’s indexing is
applied to document fields rather than table columns.
o Performance Considerations: Decisions regarding embedding versus
referencing, and handling of transactions, must be adapted for MongoDB’s
eventual consistency model if used.
Summary:
Overview:
Distributed algorithms are essential for coordinating tasks, managing resources, and
ensuring consistency across distributed systems. They are designed to handle the inherent
challenges of network delays, node failures, and concurrent operations.
Key Types:
• Consensus Algorithms:
o Purpose: Enable a group of nodes to agree on a single data value or system
state despite failures.
o Examples: Paxos and Raft. These algorithms ensure that even in the
presence of node or network failures, the system reaches a consistent
decision.
• Leader Election Algorithms:
o Purpose: Designate one node as the coordinator or leader to streamline
decision-making processes.
o Examples: Bully Algorithm and Ring Algorithm. They help in organizing nodes
so that one node handles coordination tasks.
• Mutual Exclusion Algorithms:
o Purpose: Ensure that multiple nodes do not access a shared resource
simultaneously, avoiding conflicts.
o Examples: Token Ring and Ricart-Agrawala algorithms. These algorithms are
critical for managing critical sections in a distributed environment.
• Broadcast and Multicast Algorithms:
o Purpose: Ensure that messages sent from one node are received reliably by
all (broadcast) or a specified subset (multicast) of nodes.
o Characteristics: They address issues like message ordering, reliability, and
fault tolerance.
• Distributed Snapshot Algorithms:
o Purpose: Capture a consistent global state of the system for debugging,
checkpointing, or recovery purposes.
o Examples: Chandy-Lamport algorithm, which records the state of each
node and the communication channels between them.
Summary:
Different distributed algorithms are tailored to solve specific coordination and consistency
problems in distributed systems. Their selection and implementation depend on factors
such as network reliability, failure models, and the particular application requirements,
ensuring robust and fault-tolerant system operations.
Each answer above is designed to provide clear, in-depth explanations with definitions,
mechanisms, examples, and challenges that are crucial for a high-scoring response in
exam settings.