Acmp 351
Acmp 351
SCHOOL OF SCIENCE
COMPUTER SCIENCE DEPARTMENT
COURSE OUTLINE
Course code: ACMP 351: FILE STRUCTURES AND DATABASE SYSTEMS
Credit Hour: 3.0(45 Lecturer Hours:)
Lecturer: Dr Masese Nelson
Cell phone: 0737070029
Course Objectives
1. Students should understand a general file structures and database systems
2. Students should know the emerging database technologies in database systems
3. Students should be able to know File structures and indexing
4. Students should be familiar with Concurrency control and Serializability
5. Students should know ; Query processing algorithms
Course Content:
Weeks Dates Detailed course content
Week Implementation of database systems
1
Week Introduction to emerging database technologies in database
2 systems
Week Information retrieval and spatial databases
3
Week File structures and indexing
4
Week Analytical details of file structures; Efficiency in query
5 evaluation;
Week Cat 1 & presentations
6
Week Concept of transaction; Concurrency control and Serializability
7
Week Backup and recovery for databases; Principles of database
8 organization and querying
Week Query optimization
9
Week Implementation and retrieval; Query processing algorithms
10
Week Serializability database system
11
Week Cat 2& presentations
12
Week Locking and time stamping for achieving concurrency control
13
Week concurrency control in DBMS
14
Week Exams
15
Teaching Methodologies
Lectures, Class discussion, and Case studies
Instruction Materials/Equipment
Handouts, Chalkboard, equipped Computer lab and videotapes
Assessment Strategy:
Class Test 1 10%
Class Test 2 10%
Assignments 10%
End-of-semester examination 70%
Total 100%
Method of Re-assessment:
Students failing this module will be re-assessed by supplementary examination.
References
1) Data Management And File Structures, 2Nd Ed. 2019 by Mary E.S.
Loomis
2) Advanced Databases Database.Systems. 2nd Ed . Prentice.Hall.
Implementing a database system involves several key steps and considerations to ensure that the
system is efficient, reliable, and scalable. Here's a general approach to implementing a database
system:
1. Requirements Analysis
Identify Data Needs: Understand the type of data the system will store (e.g., user
information, transactions, logs).
Determine Functional Requirements: Define how the system will interact with the data
(CRUD(create, read, update ,delete) operations, querying, reporting).
Scalability Needs: Consider the expected growth in data volume, read/write operations,
and system performance over time.
2. Database Design
Logical Design:
o Entity-Relationship Diagram (ERD): Create an ERD to represent entities,
attributes, and relationships.
Physical Design:
o Table Structure: Design tables with appropriate data types, primary keys, foreign
keys, indexes, and constraints.
3. Database Selection
Relational Databases: Choose systems like MySQL, PostgreSQL, or Oracle for
structured data with defined relationships.
NoSQL Databases: Choose databases like MongoDB, Cassandra, or DynamoDB for
unstructured or semi-structured data.
Hybrid Solutions: Use systems that offer both SQL and NoSQL capabilities, such as
Azure Cosmos DB or Couchbase.
4. Data Modeling
Views and Stored Procedures: Define reusable database objects like views, stored
procedures, and triggers to encapsulate logic and improve maintainability.
5. Database Implementation
Insert Data: Populate the database with initial data using SQL scripts or ETL tools.
Security Configuration: Set up access controls, encryption (both in transit and at rest),
and roles to ensure data protection.
Backup and Recovery: Implement backup strategies, periodic data snapshots, and
recovery plans to avoid data loss.
6. Query Optimization
Analyze Queries: Use EXPLAIN plans or profiling tools to monitor and optimize query
performance.
Indexing Strategy: Adjust indexing based on query patterns to speed up reads and
reduce disk I/O.
8. Security Considerations
Access Control: Apply role-based access control (RBAC) and least privilege principles.
Backup and Recovery Tools: AWS RDS, Google Cloud SQL, and other cloud-based or
on-prem solutions.
DBMS EXECUTION AND PARSING
The execution and parsing process in a Database Management System (DBMS) is a critical
part of how queries (typically SQL) are interpreted, optimized, and executed. This process
involves several stages, from parsing a query to executing it and returning the results. Let's break
down the key components:
1. Query Parsing
When a user submits a query to the DBMS (e.g., an SQL query), the first step is parsing.
a. Lexical Analysis:
The DBMS scans the query for keywords, identifiers (like table names and column
names), and symbols (like =, >, etc.).
This process identifies different tokens in the query to ensure the query structure is
recognizable.
b. Syntax Analysis:
The DBMS checks whether the query follows the SQL language's syntax rules.
For instance, it ensures that the clauses (SELECT, FROM, WHERE, etc.) are in the
correct order, and the query is properly structured.
c. Semantic Analysis:
In this stage, the DBMS ensures that the query makes logical sense.
It checks if the tables and columns referenced actually exist in the database and if the data
types of the columns match the operations being performed.
The semantic analysis also ensures that the user has the necessary permissions to execute
the query.
2. Query Optimization
Once a query is parsed and validated, the DBMS moves into the query optimization phase.
a. Logical Optimization:
It uses a cost model to estimate the resource cost (CPU, memory, I/O) of various
execution plans.
b. Physical Optimization:
The DBMS decides the best physical execution plan based on factors like available
indexes, disk layout, and the statistical distribution of the data.
Indexes, caching, and partitioning are considered to reduce the execution time of the
query.
The optimizer generates an execution plan, which is a series of steps that describe how
the DBMS will execute the query.
This plan includes details like the order of operations (e.g., which joins to perform first),
whether to use an index or full table scan, and how to sort or group results.
3. Query Execution
After optimization, the query is ready for execution.
a. Access Methods:
The DBMS uses different access methods to retrieve data, such as sequential scans,
index scans, or bitmap scans, depending on the execution plan.
For example, if an index exists for a column used in a WHERE clause, the DBMS may
use an index scan to quickly locate the relevant rows.
b. Join Processing:
For queries that involve joins between tables, the DBMS chooses the most efficient join
algorithm (e.g., nested-loop join, hash join, or merge join).
The choice depends on the size of the tables, the availability of indexes, and the data
distribution.
If the query involves sorting (ORDER BY), grouping (GROUP BY), or aggregating data,
the DBMS processes these operations in-memory if possible or uses disk-based sorting if
necessary.
d. Fetching Data:
The DBMS retrieves the data based on the execution plan and performs any final
operations, such as filtering or projecting specific columns.
The DBMS may apply caching mechanisms to store frequently accessed data, reducing
future query execution times.
4. Returning Results
After the execution plan is completed, the result set is formatted and returned to the client
application or user.
If the query was a SELECT statement, the result would be a table of rows. For queries
like INSERT, UPDATE, or DELETE, the result would be an acknowledgment of how
many rows were affected.
sql
Copy code
SELECT name FROM employees WHERE department = 'Engineering' ORDER BY
name;
1. Parsing:
o Lexical Analysis: The query is broken into tokens: SELECT, name, FROM,
employees, etc.
o Syntax Analysis: The DBMS checks if the syntax is correct, ensuring clauses are
in the proper order.
o Semantic Analysis: The DBMS checks if the employees table and the name and
department columns exist.
2. Optimization:
o It evaluates the cost of a full table scan vs. an index scan to filter rows where
department = 'Engineering'.
o It generates an execution plan that uses an index scan (if the index exists), then
sorts the resulting rows by the name column.
3. Execution:
o The DBMS performs the index scan to retrieve rows where department =
'Engineering'.
o It fetches the name column for the selected rows.
o The DBMS sorts the rows by name and prepares the result.
4. Returning Results:
Parser: Validates and converts the SQL query into an internal representation.
Optimizer: Finds the most efficient way to execute the query.
Execution Engine: Carries out the plan generated by the optimizer, retrieves the
necessary data, and performs the required operations.
Storage Manager: Manages the physical storage of the data, handling file systems,
memory buffers, and disk I/O.
Statistics
1 Graph databases
Graph databases are a type of NoSQL database that store data as nodes and edges, representing
entities and relationships. Graph databases are ideal for modeling complex, interconnected, and
dynamic data, such as social networks, recommendation systems, fraud detection, and
knowledge graphs. Graph databases offer high performance, scalability, and flexibility for
querying and analyzing data using graph algorithms and languages, such as Cypher and Gremlin.
2 Time-series databases
Time-series databases are a type of database that store and process data that are indexed by time,
such as sensor readings, stock prices, web analytics, and IoT events. Time-series databases are
designed to handle high ingestion rates, large volumes, and temporal queries and analysis, such
as aggregation, interpolation, anomaly detection, and forecasting. Time-series databases can also
integrate with other data sources and tools, such as streaming platforms, machine learning
frameworks, and visualization dashboards.
3 Cloud-native databases
Cloud-native databases are a type of database that are built for and run on cloud platforms,
leveraging the cloud services and features, such as elasticity, availability, security, and
scalability. Cloud-native databases can support various data models, workloads, and
architectures, such as relational, NoSQL, hybrid, serverless, and distributed. Cloud-native
databases can also benefit from the cloud ecosystem, such as integration, automation,
monitoring, and backup.
4 Multi-model databases
Multi-model databases are a type of database that support multiple data models, such as
relational, document, graph, key-value, and columnar, within a single database system. Multi-
model databases enable you to store and access different types of data without compromising on
performance, consistency, or functionality. Multi-model databases can also simplify the data
management and development process, as you do not need to use multiple databases or tools for
different data needs.
5 NewSQL databases
NewSQL databases are a type of database that combine the features and benefits of both
relational and NoSQL databases, such as ACID transactions, SQL compatibility, horizontal
scalability, and high availability. NewSQL databases aim to address the limitations and trade-
offs of traditional and NoSQL databases, especially for OLTP (online transaction processing)
workloads that require fast and reliable data processing. NewSQL databases can also leverage
the cloud infrastructure and modern architectures, such as microservices and containers.
6. Quantum databases
Quantum databases are a type of database that use quantum computing and quantum mechanics
to store and manipulate data. Quantum databases are still in the research and development stage,
but they have the potential to offer unprecedented speed, security, and efficiency for data
operations, such as encryption, search, and optimization. Quantum databases can also handle
complex and large-scale data problems, such as machine learning, cryptography, and simulation.
A SPATIAL DATABASE
A spatial database is a type of database that is optimized for storing, querying, and managing
spatial or geographical data. These databases handle data related to objects in a two-
dimensional or three-dimensional space, such as points, lines, and polygons. Spatial databases
are widely used in applications involving geographical information systems (GIS), location-
based services, urban planning, navigation, and environmental modeling.
3. Spatial Indexing:
o Spatial indexes are used to speed up the retrieval of spatial data by efficiently
locating objects in space.
4. Spatial Queries: Spatial databases extend standard SQL with spatial functions, allowing
queries to interact with spatial objects.
Proximity Search: Find all objects within a certain distance from a given
point (e.g., find all restaurants within 5 km of a user’s location).
Intersection: Identify objects that intersect with a given shape (e.g., find
all parks that overlap a specific neighborhood).
5. Geospatial Functions:
o ST_Buffer(): Creates a buffer area around a geometry (e.g., finding a zone within
a certain radius).
6. Data Formats: Spatial databases often interact with spatial data in formats such as:
o Shapefile: A popular file format for geospatial vector data used in GIS software.
o It supports a wide range of geospatial data types, spatial indexing (R-tree, GIST
(Generalized Search Tree)), and functions for querying and analyzing geographic
data.
o MySQL offers spatial data support through its spatial extensions, allowing the use
of spatial data types like POINT, LINESTRING, and POLYGON.
o It includes spatial indexing (R-tree for InnoDB) and spatial functions for querying
and manipulation.
3. Oracle Spatial:
o Oracle Spatial provides advanced support for storing, managing, and querying
spatial data.
o It offers spatial indexing and a variety of geospatial functions for operations like
intersection, distance calculation, and transformations.
o For instance, GIS applications use spatial queries to perform proximity searches,
mapping, and spatial data analysis.
o Mobile apps and web services that provide location-based recommendations (e.g.,
food delivery, navigation, or ridesharing) use spatial databases to manage
geographical data and perform location-based searches.
3. Urban Planning:
o Spatial databases are used by city planners to manage land use, zoning,
infrastructure planning, and environmental monitoring.
o For example, a spatial query can be used to analyze the distribution of green
spaces within a city.
4. Environmental Monitoring:
o Spatial databases store data from remote sensing, tracking environmental changes
such as deforestation, air quality, or climate models.
5. Navigation Systems:
6. Real Estate:
o Real estate platforms use spatial databases to track properties, their locations, and
their proximity to amenities like schools, parks, and transport hubs.
o Spatial queries allow users to search for properties within certain geographic
boundaries.
Representation Points, lines, polygons in 2D/3D space Numbers, strings, dates, etc.
Storage Specialized spatial storage and indexing Standard database tables and indexing
Queries Spatial queries (e.g., distance, proximity) Attribute-based queries (e.g., filter, sort)
Both spatial and non-spatial data are critical depending on the context, with spatial data being
essential for geographical analysis, while non-spatial data serves in descriptive, statistical, or
transactional roles.
Imagine a city planning office that needs to analyze the city's zoning areas, locations of parks,
and residential areas. The database contains spatial data in the form of polygons (zoning districts,
parks, residential areas) and points (buildings, fire hydrants, etc.).
sql
Copy code
-- Create a table to store zones with a spatial geometry column
CREATE TABLE city_zones (
id SERIAL PRIMARY KEY,
name VARCHAR(255),
zone_type VARCHAR(50),
geom GEOMETRY(Polygon, 4326) -- Storing the geometry as a polygon with spatial
reference system EPSG:4326
);
sql
Copy code
-- Inserting a zoning district as a polygon
INSERT INTO city_zones (name, zone_type, geom)
VALUES ('Downtown', 'Commercial', ST_GeomFromText('POLYGON((-71.064544 42.28787,
-71.060174 42.290673, -71.057568 42.28794, -71.064544 42.28787))', 4326));
Find the distance between two points: If you have two buildings or locations stored as
points, you can compute the distance between them.
sql
Copy code
SELECT ST_Distance(
ST_GeomFromText('POINT(-71.0589 42.285)', 4326),
ST_GeomFromText('POINT(-71.0621 42.292)', 4326)
) AS distance;
Find parks within a specific zoning area: If you want to find all the parks within the
"Downtown" zoning area, you can use spatial relationships.
sql
Copy code
SELECT p.name
FROM city_parks p, city_zones z
WHERE z.name = 'Downtown'
AND ST_Contains(z.geom, p.geom);
Buffer and proximity search: If you want to find all the buildings within a 500-meter
radius of a park.
sql
Copy code
SELECT b.name
FROM city_buildings b, city_parks p
WHERE ST_DWithin(b.geom, p.geom, 500); -- Distance in meters
Check if two polygons overlap: If you want to know if a park boundary overlaps with a
residential area.
sql
Copy code
SELECT z.name, p.name
FROM city_zones z, city_parks p
WHERE z.zone_type = 'Residential'
AND ST_Overlaps(z.geom, p.geom);
Spatial databases use special indexes, such as R-trees, to improve the performance of spatial
queries.
sql
Copy code
-- Create a spatial index on the geometry columns
CREATE INDEX idx_zones_geom ON city_zones USING GIST (geom);
CREATE INDEX idx_parks_geom ON city_parks USING GIST (geom);
These indexes allow efficient querying of spatial data, particularly for large datasets.
In Database Management Systems (DBMS), Information Retrieval (IR) refers to the process
of obtaining relevant information from a collection of data based on a user's query. The goal of
IR in the context of DBMS is to help users find data stored in a database that matches their
informational needs.
Data Retrieval focuses on retrieving exact matches for a structured query (like using
SQL to retrieve rows from a table).
Information Retrieval focuses on retrieving documents or records that are relevant to a
user's query, often from unstructured or semi-structured data (e.g., full-text search in
documents).
Information Retrieval (IR) refers to the process of obtaining relevant information from large
collections of unstructured or semi-structured data, such as text documents, images, or
multimedia files, based on user queries. It is widely used in systems like search engines, digital
libraries, and databases where the goal is to retrieve the most relevant information in response to
a user's need.
Document: The unit of data in the collection that is being searched. It could be a web
page, PDF, email, multimedia file, etc.
Query: The user’s expression of their information need. Queries can be natural language
text or structured searches based on keywords.
2. Retrieval Models
Retrieval models are the mathematical frameworks used to compute the relevance between a
query and documents in the collection.
Boolean Model: A simple model where documents are retrieved if they exactly match
the boolean expression in the query (using AND, OR, NOT).
Vector Space Model (VSM): Documents and queries are represented as vectors in a
multidimensional space. The relevance is calculated based on the cosine similarity
between the document and query vectors.
Probabilistic Models: These models estimate the probability that a document is relevant
to a given query. The most well-known probabilistic model is the BM25 algorithm.
3. Indexing
To efficiently retrieve information, data is preprocessed into an index, which enables fast
querying. Common types of indexes include:
Inverted Index: A structure that maps terms (words) to the documents they appear in,
allowing for efficient search.
Positional Index: An inverted index that also stores the position of each word in a
document, enabling phrase searches.
4. Ranking
Ranking is the process of ordering retrieved documents based on their relevance to the query.
The goal is to present the most useful results at the top of the search results. Some ranking
techniques include:
PageRank: A link-based ranking algorithm (used by Google) that ranks documents (web
pages) based on how many other documents link to them.
Precision: The proportion of retrieved documents that are relevant to the query.
Recall: The proportion of relevant documents that are successfully retrieved by the
system.
F1-Score: The harmonic mean of precision and recall, used to balance the two metrics.
6. Relevance Feedback
This is a technique where the system improves future retrieval results based on feedback from
users. Users can mark certain documents as relevant or not relevant, and the system refines its
rankings based on this input.
A technique to capture the underlying meaning in documents and queries by reducing the
dimensionality of the term-document matrix using Singular Value Decomposition (SVD). It
helps in addressing the issue of synonyms and polysemy (words with multiple meanings).
8. Query Expansion
To improve retrieval results, IR systems can automatically expand queries by adding synonyms
or related terms to the original query.
9. Natural Language Processing (NLP) in IR
NLP techniques are increasingly used in IR to handle unstructured text and understand user
intent. This involves:
Named Entity Recognition (NER): Identifying entities such as people, places, dates,
etc., within a text.
Search Engines: The most common example where IR principles are applied at scale.
Digital Libraries: Systems for academic papers, books, and research documents.
Synonyms and Ambiguity: Dealing with words that have multiple meanings or different
words that mean the same thing.
Evaluation: Measuring the effectiveness of the retrieval process using real-world data.
In summary, Information Retrieval is about finding the right data from large datasets, often
unstructured, and ranking it effectively to match a user’s query. Techniques like indexing,
ranking algorithms, and feedback mechanisms ensure that systems can provide relevant and
accurate results to the users.
FILE STRUCTURES AND INDEXING
File Structures and Indexing are fundamental concepts in database management and
information retrieval systems. These techniques ensure efficient data storage, access, and
retrieval, optimizing the performance of systems when handling large datasets.
1. File Structures
File structures refer to the ways data is physically stored and organized on storage devices such
as hard drives or SSDs. The design of file structures affects the performance of data retrieval
operations.
1. Sequential File Structure: Data is stored in a linear sequence. This structure is simple
and works well for batch processing. However, sequential access can be slow if the
dataset is large, and random access is inefficient.
o Pros: Simple to implement and manage.
---------------------------------------------
2. Heap (Unordered) File Structure: Records are placed wherever there is available space.
This is the simplest file structure. Insertions are fast, but searching for a record requires a
full scan.
o Pros: Efficient for inserts.
o Use Case: Suitable when insertions are frequent, but searches are less common.
3. Hashed File Structure: A hash function is used to compute the storage location of
records based on key values. This provides constant time complexity for search,
insertion, and deletion operations, as long as there are no hash collisions.
o Pros: Fast access to specific records based on a key.
o Cons: Hash collisions can degrade performance; not suitable for range queries.
o Use Case: Ideal for situations where exact record lookups are frequent.
4. Indexed File Structure: This structure maintains an index to accelerate data retrieval.
Records are stored sequentially, and an index file is used to quickly locate records.
----------------------------------
1001 |1
1002 |2
1003 |3
1004 |4
1005 |5
Direct access: The index allows the system to directly access the file location where the
record is stored, bypassing the need for sequential search
5. Clustered File Structure: Records that are often accessed together (like related records
from different tables) are stored near each other. This minimizes disk I/O operations
when retrieving related records.
o Pros: Faster access to related records.
o Use Case: Database systems that need to support frequent join operations.
2. Indexing
Indexing is a technique used to speed up the retrieval of records from a database by creating
additional data structures (indices) that allow quick access to records. The goal of an index is
similar to a book index—it enables direct access to specific data without scanning the entire file.
Types of Indexing:
1) Primary Index: Created on the primary key of a table. Each record has a unique value,
and the index is usually based on this key. The primary index is often a dense index,
where every record in the database has a corresponding index entry.
o Pros: Efficient lookup of unique records.
o Use Case: Queries that involve searching by non-primary fields like names or
dates.
3) Dense Index: Contains an index entry for every record in the file. Every key in the index
points directly to a record.
4) Sparse Index: Only a subset of the records are indexed, with the index pointing to blocks
of records. The actual record is then found by sequentially searching within the block.
o Cons: Slightly slower than dense indexing since sequential searching is needed
within blocks.
5) Multilevel Index: An index where large indices are divided into smaller, more
manageable parts. For example, the first level might index the second level, and the
second level indexes the actual records.
o Use Case: When queries involve range scans (e.g., date ranges).
7) Non-Clustered Index: This index doesn’t affect the physical order of the rows. Instead,
it maintains a logical order of records and contains pointers to the actual data.
B-trees and B+ trees are balanced tree data structures commonly used in databases and
file systems to store and retrieve data efficiently, particularly when dealing with large
datasets that don’t fit entirely in memory. Both structures help maintain sorted data and
allow for efficient insertion, deletion, and search operations.
Balanced tree structures used in databases to ensure that searches, insertions, and
deletions are efficiently .B-trees index both keys and records, while B+ trees store all
records in leaf nodes, with only keys in the internal nodes.
3) Bitmap Index: A compact, bitwise index used for columns with a limited range of values
(e.g., gender, binary fields). It uses a bitmap for each value of the column to indicate
which rows contain that value.
o Pros: Very efficient for columns with a small number of distinct values.
o Cons: Not suitable for columns with high cardinality (many distinct values).
The choice of file structures and indexing strategies depends on various factors:
Query patterns: Whether queries involve exact match lookups, range queries, or both.
Data modification: Whether the dataset is mostly static (read-heavy) or frequently
updated (write-heavy).
Summary
File Structures determine how data is physically organized in storage, affecting access
efficiency. Common file structures include sequential, heap, hashed, indexed, and
clustered file structures.
Indexing provides a mechanism to quickly locate and access data. Various types of
indexes (primary, secondary, dense, sparse, etc.) optimize different query types and data
access patterns.
Common indexing structures like B-trees, hashing, and bitmap indexes provide efficient
access, depending on the dataset and query requirements.
Choosing the right file structure and indexing technique is crucial for optimizing database
performance, especially for large datasets.