0% found this document useful (0 votes)
12 views

Database and SQL Queries d - Copy (4)

Uploaded by

gracious pezoh
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Database and SQL Queries d - Copy (4)

Uploaded by

gracious pezoh
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 68

Relational Database Principles

12 Key Principles of Database Design

#1: Avoid Redundancy

Redundant information in a database schema can cause several


problems. It can lead to inconsistencies: if the same data is stored in
multiple tables, there is a risk that it will be updated in one table and
not in the others – generating discrepancies in the results.

Redundancy also unnecessarily increases the storage space required by


the database and can negatively affect query performance and data
update operations. The normalization principle (see below) is used to
eliminate redundancy in database models.

Above, we see an example of unwanted redundancy in a data model:


the customer_name column (dependent on customer_id) is in both
the customer table and the order table. This creates the risk that it
contains different information in each table.
An exception to the principle of non-redundancy occurs in dimensional
schemas, which are used for analytical processing and data
warehousing. In dimensional schemas, a certain degree of data
redundancy can be used to reduce complexity in combined queries.
Otherwise, such queries would be more resource intensive.

Above we have an example of a dimensional schema. The


columns CustomerName and ProductDescription are repeated in
the SalesFact table to avoid the need to join tables when querying
these columns.

#2: Primary Keys and Unique Identifiers


Every table must have a PRIMARY KEY. The primary key is essential
because it guarantees the uniqueness of each row within the table. In
addition, the primary key is used to establish relationships with other
tables in the database.

Without a primary key, a table would have no reliable way to identify


individual records. This can lead to data integrity problems, issues with
query accuracy, and difficulties in updating that table. If you leave a
table without a primary key in your schema, you run the risk of that
table containing duplicate rows that will cause incorrect query results
and application errors.

On the other hand, primary keys make it easier to interpret your data
model. By seeing the primary keys of every table in an ENTITY-
RELATIONSHIP DIAGRAM (ERD), the programmer writing a query will
know how to access each table and how to join it with others.
Having a primary key in each table ensures that relationships can be
maintained between tables.

#3: Null Values

In relational databases, null values indicate unknown, missing, or non-


applicable data. When defining each column in a table, you must
establish whether it supports null values. You should only allow the
column to support null values if you’re certain that, at some point, the
value of that column may be unknown, missing, or not applicable.

It is also important to differentiate null values from “empty” values,


such as the number zero or a zero-length string. Read these tips
on HOW TO MAKE GOOD USE OF NULLABLE COLUMNS for more
information.
Null values have certain peculiarities:

 Primary keys can never store null values.

 A null value can be applied to columns of any data type (as long as
the column supports null values).

 Null values are ignored in unique and foreign key constraints.

 In SQL, a null value is different from any value – even from


another null value.

 Any SQL operation involving a null value will result in another null
value. The exception is mathematical aggregate functions like
SUM(), where null values are treated as zeros.

#4: Referential Integrity

Referential integrity guarantees that columns involved in the


relationship between two tables (e.g. primary and foreign keys) will
have shared values. This means that the values of the columns in the
secondary (child or dependent) table must exist in the corresponding
columns of the primary (parent) table.

Furthermore, referential integrity requires that such column values are


unique in the primary table – ideally, they should constitute the
primary key of the primary table. In the secondary table, the columns
involved in the relationship must constitute a foreign key. Read WHAT
IS A FOREIGN KEY? for more information.

By establishing relationships between tables using foreign keys and


primary keys, we make use of the database engine’s resources to
ensure data integrity. This also improves the performance of queries
and other database operations. Foreign and primary keys facilitate the
creation of indexes to speed up table lookups; we’ll discuss this more in
the indexing section of this article.

In this example, referential integrity ensures that each booking is


associated with a passenger and a room. It also ensures a room type is
specified for each room.

#5: Atomicity

Sometimes, you may feel tempted to have a single column for complex
or compound data. For example, the table below stores complete
addresses and full names in single fields:
customer_no customer_name customer_address

1001 Kaelyn Hodge 5331 Rexford Court, Montgomery AL 36116

2050 Brayden Huang 4001 Anderson Road, Nashville TN 37217

4105 Louis Kelly 2325 Eastridge Circle, Moore OK 73160

3412 Jamarion Dawson 4016 Doane Street, Fremont CA 94538

In this example, the customer_address column clearly violates the


atomicity principle, since it stores composite data that could be divided
into smaller pieces. The customer_name field also violates this
principle, as it stores first and last names in the same value.

If you combine more than one piece of information in one column, it


will be difficult to access individual data later. For example, what if you
wanted to address the customer by their first name or see which
customers live in the state of Oklahoma? The query would get very
complicated!

Try to divide the information into logical parts; in our example, you
could create separate fields for first name and last name and for
address, city, state, and postal code. This principle is made explicit in
the first normal form, which we’ll discuss next.

#6: Normalization

In any data model for transactional applications or processes – such as


an online banking or e-commerce site – it is crucial to avoid anomalies
in the processes of inserting, updating, or deleting data.
This is where normalization techniques are applied; they seek to
eliminate clutter, disorganization, and inconsistencies in data models.
Normalization is a formal method for correcting data models that you
can intuitively sense, just by looking at them, that there is something
wrong. Below, the orange table is an example of a non-normalized
model:

Any experienced database designer will quickly observe that the


orange Sales table has many defects that could be solved by
normalization. This would result in the green (normalized) tables on the
right.

In practice, normalizing a model is a matter of bringing it to the third


normal form (3NF). Bringing a data model to 3NF maintains data
integrity, reduces redundancy, and optimizes the storage space
occupied by the database. The lower normal forms (1NF and 2NF) are
intermediate steps towards 3NF, while the higher normal forms (4NF
and 5NF) are rarely used. Read our article on NORMALIZATION AND
THE THREE NORMAL FORMS for more information.

#7: Data Typing

When designing a schema, you must choose the appropriate data type
for each column of each table. You’ll choose a data type according to
the nature and format of the information expected to be stored in that
column.

If, for example, you are creating a column where telephone numbers
will be stored, you could associate a numeric data type to it (such as
INT) if it will only store numbers. But, if it must also store other
characters – such as parentheses, hyphens, or spaces – the data type
should be VARCHAR.

On the other hand, if you create a column to store dates, common


database modeling tips suggest that the data type should be DATE,
DATETIME, or TIMESTAMP. If you define this column as VARCHAR,
dates can be stored in very different formats (e.g. '18-Jan-2023', '2023-
01-18', '01/18/23'); it will be impossible to use this column as a search
criterion or as a filter for queries.

8: Indexing

Indexes are data structures that make it easy for the database engine to
locate the row(s) that meet a given search criteria. Each index is
associated with a table and contains the values of one or more columns
of that table. Read our article WHAT IS A DATABASE INDEX? for more
information.
9: Schema Partitioning

Large schemas are difficult to read and manage when the totality of
their tables exceeds the dimensions of a medium-sized poster or a
couple of screens. At this point, partitioning the schema becomes
necessary so that the schema can be visualized by sections.

The criterion used to partition a large schema is up to the designer and


the consumers of that schema (developers, DBAs, database architects,
etc.). The idea is to choose the partitioning criteria that is most useful.

#10: Authentication and Access Control

Controlling access to a system through user authentication is one of the


most basic principles to prevent data misuse and promote information
security.

User authentication is very familiar to all of us; it’s how a system,


database, program, etc. ensures that 1) a user is who they say they are,
and 2) the user only accesses the information or parts of the system
they are entitled to see.

A schema intended to provide authentication and access control must


allow for registering new users, supporting different authentication
factors, and providing options for recovering passwords. It must also
protect authentication data from unauthorized users and define user
permissions by roles and levels.

#11: Conceal Sensitive Information

Every database must be prepared to resist hacking and attempts to


access data by unauthorized users. Even if security mechanisms such as
firewalls and password policies are in place, database design is the last
line of defense to protect information when all other methods fail.

There are several things that you, as a designer, can do to minimize the
risks of unauthorized access to information. One of them is to provide
columns that support encrypted or hashed data. String encryption and
hashing techniques alter the length of character strings and the set of
characters that can be allowed. When you’re defining VARCHAR
columns to store data that can be encrypted or hashed, you must take
into account both the maximum length and the range of characters
they can have.

12: Don’t Store Authentication Keys

Good design practices for security and user authentication include not
storing keys, even encrypted ones. All encrypted data carries the risk of
being decrypted. For this reason, hash functions that are not bijective
are used to protect keys. This means that there is no way to use a hash
function result to obtain the original data. Instead of storing the
encrypted key, only the hash of that key is stored.

A hashed key, even if it does not allow finding the original key, serves as
an authentication mechanism: if the hash of a password entered during
a login session matches the hash stored for the user trying to log in,
then there is no doubt that the password entered is the correct one.

It is important to restrict write permissions for the table and column


where a hashed password is stored. This helps prevent potential
attackers from altering the stored hash to one that corresponds to a
known password
Functional Dependencies

A functional dependency (FD) is a relationship between two attributes,


typically between the PK and other non-key attributes within a table.
For any relation R, attribute Y is functionally dependent on attribute X
(usually the PK), if for every valid instance of X, that value of X uniquely
determines the value of Y

In the first example, below, SIN determines Name, Address and


Birthdate. Given SIN, we can determine any of the other attributes
within the table.

SIN ———-> Name, Address, Birthdate

For the second example, SIN and Course determine the date completed
(DateCompleted). This must also work for a composite PK.

SIN, Course ———> DateCompleted

The third example indicates that ISBN determines Title.

ISBN ———–> Title

Rules of Functional Dependencies

Inference Rules

Armstrong’s axioms are a set of inference rules used to infer all the functional dependencies on a
relational database. They were developed by William W. Armstrong. The following describes what will
be used, in terms of notation, to explain these axioms.

Axiom of reflexivity

This axiom says, if Y is a subset of X, then X determines Y (see Figure 11.1).


Figure 11.1. Equation for axiom of reflexivity.

pendent on the PK. In the example shown below, StudentName,


Address, City, Prov, and PC (postal code) are only dependent on the
StudentNo, not on the StudentNo and Grade.

StudentNo, Course —> StudentName, Address, City, Prov, PC, Grade,


DateCompleted

This situation is not desirable because every non-key attribute has to


be fully dependent on the PK. In this situation, student information is
only partially dependent on the PK (StudentNo).

To fix this problem, we need to break the original table down into two
as follows:

 Table 1: StudentNo, Course, Grade, DateCompleted

 Table 2: StudentNo, StudentName, Address, City, Prov, PC

Axiom of transitivity

The axiom of transitivity says if X determines Y, and Y determines Z,


then X must also determine Z (see Figure 11.3).

Ch-11-Axiom-of-transitivity-300x30

Figure 11.3. Equation for axiom of transitivity.

The table below has information not directly related to the student; for
instance, ProgramID and ProgramName should have a table of its own.
ProgramName is not dependent on StudentNo; it’s dependent on
ProgramID.

StudentNo —> StudentName, Address, City, Prov, PC, ProgramID,


ProgramName

This situation is not desirable because a non-key attribute


(ProgramName) depends on another non-key attribute (ProgramID).

To fix this problem, we need to break this table into two: one to hold
information about the student and the other to hold information about
the program.

 Table 1: StudentNo —> StudentName, Address, City, Prov, PC,


ProgramID

 Table 2: ProgramID —> ProgramName

However we still need to leave an FK in the student table so that we


can identify which program the student is enrolled in.

Union

This rule suggests that if two tables are separate, and the PK is the
same, you may want to consider putting them together. It states that if
X determines Y and X determines Z then X must also determine Y and Z
(see Figure 11.4).

Figure 11.4. Equation for the Union


rule.
Decomposition

Decomposition is the reverse of the Union rule. If you have a table that
appears to contain two entities that are determined by the same PK,
consider breaking them up into two tables. This rule states that if X
determines Y and Z, then X determines Y and X determines Z separately
(see Figure 11.5).

Figure 11.5. Equation for


decompensation rule.

Dependency Diagram

A dependency diagram, shown in Figure 11.6, illustrates the various


dependencies that might exist in a non-normalized table. A non-
normalized table is one that has data redundancy in it.

Normalization

The goal of normalization is to make every datapoint have the same


scale so each feature is equally important. The image below shows the
same house data normalized using min-max normalization

Min-Max Normalization

Min-max normalization is one of the most common ways to normalize


data. For every feature, the minimum value of that feature gets
transformed into a 0, the maximum value gets transformed into a 1,
and every other value gets transformed into a decimal between 0 and
1.
For example, if the minimum value of a feature was 20, and the
maximum value was 40, then 30 would be transformed to about
0.5 since it is halfway between 20 and 40.

Z-Score Normalization
Z-score normalization is a strategy of normalizing data that avoids
this outlier issue.

What is Database Normalization?

In database management systems (DBMS), normal forms are a series of


guidelines that help to ensure that the design of a database is efficient,
organized, and free from data anomalies. There are several levels of
normalization, each with its own set of guidelines, known as normal
forms.

Important Points Regarding Normal Forms in DBMS

 First Normal Form (1NF): This is the most basic level of


normalization. In 1NF, each table cell should contain only a single
value, and each column should have a unique name. The first
normal form helps to eliminate duplicate data and simplify
queries.

 Second Normal Form (2NF): 2NF eliminates redundant data by


requiring that each non-key attribute be dependent on the
primary key. This means that each column should be directly
related to the primary key, and not to other columns.

 Third Normal Form (3NF): 3NF builds on 2NF by requiring that all
non-key attributes are independent of each other. This means
that each column should be directly related to the primary key,
and not to any other columns in the same table.

 Boyce-Codd Normal Form (BCNF): BCNF is a stricter form of 3NF


that ensures that each determinant in a table is a candidate key.
In other words, BCNF ensures that each non-key attribute is
dependent only on the candidate key.

 Fourth Normal Form (4NF): 4NF is a further refinement of BCNF


that ensures that a table does not contain any multi-valued
dependencies.

 Fifth Normal Form (5NF): 5NF is the highest level of normalization


and involves decomposing a table into smaller tables to remove
data redundancy and improve data integrity.

Normal forms help to reduce data redundancy, increase data


consistency, and improve database performance. However, higher
levels of normalization can lead to more complex database designs and
queries. It is important to strike a balance between normalization and
practicality when designing a database

Advantages of Normal Form

 Reduced data redundancy: Normalization helps to eliminate


duplicate data in tables, reducing the amount of storage space
needed and improving database efficiency.

 Improved data consistency: Normalization ensures that data is


stored in a consistent and organized manner, reducing the risk of
data inconsistencies and errors.
 Simplified database design: Normalization provides guidelines for
organizing tables and data relationships, making it easier to design
and maintain a database.

 Improved query performance: Normalized tables are typically


easier to search and retrieve data from, resulting in faster query
performance.

 Easier database maintenance: Normalization reduces the


complexity of a database by breaking it down into smaller, more
manageable tables, making it easier to add, modify, and delete
data.

Applications of Normal Forms in DBMS

 Data consistency: Normal forms ensure that data is consistent


and does not contain any redundant information. This helps to
prevent inconsistencies and errors in the database.

 Data redundancy: Normal forms minimize data redundancy by


organizing data into tables that contain only unique data. This
reduces the amount of storage space required for the database
and makes it easier to manage.

 Query performance: Normal forms can improve query


performance by reducing the number of joins required to retrieve
data. This helps to speed up query processing and improve overall
system performance.

 Database maintenance: Normal forms make it easier to maintain


the database by reducing the amount of redundant data that
needs to be updated, deleted, or modified. This helps to improve
database management and reduce the risk of errors or
inconsistencies.

 Database design: Normal forms provide guidelines for designing


databases that are efficient, flexible, and scalable. This helps to
ensure that the database can be easily modified, updated, or
expanded as needed.

Integrity Constraints

o Integrity constraints are a set of rules. It is used to maintain the quality of information.

o Integrity constraints ensure that the data insertion, updating, and other processes have to be
performed in such a way that data integrity is not affected.

o Thus, integrity constraint is used to guard against accidental damage to the database.

Types of Integrity Constraint

1. Domain constraints
o Domain constraints can be defined as the definition of a valid set
of values for an attribute.

o The data type of domain includes string, character, integer, time,


date, currency, etc. The value of the attribute must be available in
the corresponding domain.

Example:

3. Referential Integrity Constraints

o A referential integrity constraint is specified between two tables.

o In the Referential integrity constraints, if a foreign key in Table 1


refers to the Primary Key of Table 2, then every value of the
Foreign Key in Table 1 must be null or be available in Table 2.

Example:
. Key constraints

o Keys are the entity set that is used to identify an entity within its
entity set uniquely.

o An entity set can have multiple keys, but out of which one key will
be the primary key. A primary key can contain a unique and null
value in the relational table.

Example:
What is SQL?
Structured query language (SQL) is a programming language for storing
and processing information in a relational database. A relational
database stores information in tabular form, with rows and columns
representing different data attributes and the various relationships
between the data values. You can use SQL statements to store, update,
remove, search, and retrieve information from the database. You can
also use SQL to maintain and optimize database performance.

Why is SQL important?

Structured query language (SQL) is a popular query language that is


frequently used in all types of applications. Data analysts and
developers learn and use SQL because it integrates well with different
programming languages. For example, they can embed SQL queries
with the Java programming language to build high-performing data
processing applications with major SQL database systems such as
Oracle or MS SQL Server. SQL is also fairly easy to learn as it uses
common English keywords in its statements

What are the components of a SQL system?

Relational database management systems use structured query


language (SQL) to store and manage data. The system stores multiple
database tables that relate to each other. MS SQL Server, MySQL, or MS
Access are examples of relational database management systems. The
following are the components of such a system.

SQL table
A SQL table is the basic element of a relational database. The SQL
database table consists of rows and columns. Database engineers
create relationships between multiple database tables to optimize data
storage space.

For example, the database engineer creates a SQL table for products in
a store:

Product ID Product Name Color ID

0001 Mattress Color 1

0002 Pillow Color 2

Then the database engineer links the product table to the color table
with the Color ID:

Color ID Color Name

Color 1 Blue

Color 2 Red

SQL statements

SQL statements, or SQL queries, are valid instructions that relational


database management systems understand. Software developers build
SQL statements by using different SQL language elements. SQL
language elements are components such as identifiers, variables, and
search conditions that form a correct SQL statement.

For example, the following SQL statement uses a SQL INSERT command
to store Mattress Brand A, priced $499, into a table
named Mattress_table, with column names brand_name and cost:
INSERT INTO Mattress_table (brand_name, cost)

VALUES(‘A’,’499’);

Stored procedures

Stored procedures are a collection of one or more SQL statements


stored in the relational database. Software developers use stored
procedures to improve efficiency and performance. For example, they
can create a stored procedure for updating sales tables instead of
writing the same SQL statement in different applications.

How does SQL work?

Structured query language (SQL) implementation involves a server


machine that processes the database queries and returns the results.
The SQL process goes through several software components, including
the following.

Parser

The parser starts by tokenizing, or replacing, some of the words in the


SQL statement with special symbols. It then checks the statement for
the following:

Correctness

The parser verifies that the SQL statement conforms to SQL semantics,
or rules, that ensure the correctness of the query statement. For
example, the parser checks if the SQL command ends with a semi-
colon. If the semi-colon is missing, the parser returns an error.

Authorization
The parser also validates that the user running the query has the
necessary authorization to manipulate the respective data. For
example, only admin users might have the right to delete data.

Relational engine

The relational engine, or query processor, creates a plan for retrieving,


writing, or updating the corresponding data in the most effective
manner. For example, it checks for similar queries, reuses previous data
manipulation methods, or creates a new one. It writes the plan in an
intermediate-level representation of the SQL statement called byte
code. Relational databases use byte code to efficiently perform
database searches and modifications.

Storage engine

The storage engine, or database engine, is the software component


that processes the byte code and runs the intended SQL statement. It
reads and stores the data in the database files on physical disk storage.
Upon completion, the storage engine returns the result to the
requesting application.

What are SQL commands?

Structured query language (SQL) commands are specific keywords or


SQL statements that developers use to manipulate the data stored in a
relational database. You can categorize SQL commands as follows.

Data definition language

Data definition language (DDL) refers to SQL commands that design the
database structure. Database engineers use DDL to create and modify
database objects based on the business requirements. For example, the
database engineer uses the CREATE command to create database
objects such as tables, views, and indexes.

Data query language

Data query language (DQL) consists of instructions for retrieving data


stored in relational databases. Software applications use the SELECT
command to filter and return specific results from a SQL table.

Data manipulation language

Data manipulation language (DML) statements write new information


or modify existing records in a relational database. For example, an
application uses the INSERT command to store a new record in the
database.

Data control language

Database administrators use data control language (DCL) to manage or


authorize database access for other users. For example, they can use
the GRANT command to permit certain applications to manipulate one
or more tables.

Transaction control language

The relational engine uses transaction control language (TCL) to


automatically make database changes. For example, the database uses
the ROLLBACK command to undo an erroneous transaction.

What is SQL injection?

SQL injection is a cyberattack that involves tricking the database with


SQL queries. Hackers use SQL injection to retrieve, modify, or corrupt
data in a SQL database. For example, they might fill in a SQL query
instead of a person's name in a submission form to carry out a SQL
injection attack.

What is MySQL?

MySQL is an open-source relational database management system


offered by Oracle. Developers can download and use MySQL without
paying a licensing fee. They can install MySQL on different operating
systems or cloud servers. MySQL is a popular database system for web
applications.

SQL vs. MySQL

Structured query language (SQL) is a standard language for database


creation and manipulation. MySQL is a relational database program
that uses SQL queries. While SQL commands are defined by
international standards, the MySQL software undergoes continual
upgrades and improvements.

What is NoSQL?

NoSQL refers to non-relational databases that don't use tables to store


data. Developers store information in different types of NoSQL
databases, including graphs, documents, and key-values. NoSQL
databases are popular for modern applications because they are
horizontally scalable. Horizontal scaling means increasing the
processing power by adding more computers that run NoSQL software.

SQL vs. NoSQL

Structured query language (SQL) provides a uniform data manipulation


language, but NoSQL implementation is dependent on different
technologies. Developers use SQL for transactional and analytical
applications, whereas NoSQL is suitable for responsive, heavy-usage
applications.

What is a SQL server?

SQL Server is the official name of Microsoft's relational database


management system that manipulates data with SQL. The MS SQL
Server has several editions, and each is designed for specific workloads
and request

DATABASE ADMINISTRATION
Physical implimentation of the data
The physical model describes the database in a specific working
environment that includes a specific database product, a specific
hardware and network configuration, and a specific level of data
update and retrieval activity. The physical implementation makes this
specification real. The implemented database contains objects (e.g.,
tables, views, indexes) that correspond to the objects in your physical
model.

generally organize the physical implementation into the following


seven steps, which you can use as a checklist to make sure you've
performed all the tasks necessary to implement the physical design.

Step 1. Select a server. The Data Use Analysis and Data Volume Analysis
models, define the guidelines for choosing a server with adequate CPU
power and enough hard disk capacity to see you through the first few
years of operation. The Data Use Analysis model lets you visually map
the important processes that run on a database. You then calculate
average and maximum read and write operations. From this analysis,
you can see the kind of processing power you need, then translate that
into the CPU model and the RAM you need for prime performance

Step 2. Create a database. You can use the Data Volume Analysis
model again to guide you in sizing the user data and transaction log file.
This model gives you a rough idea of space requirements, which you
can translate into initial database file sizes.

Step 3. Create the database objects. You have two options for creating
the tables, indexes, views, constraints, stored procedures, and triggers
that make up an operational database. First, you can use the physical
data model to guide you in writing SQL scripts that you can later
execute, or you can create the objects directly by using Enterprise
Manager's graphical and programming interface. Second, if you've used
CASE software such as Visio 2000 to help with the modeling, you can let
the CASE software generate the scripts for you

Step 4. Load the data. How should you approach loading data into the
database? The answer depends on where the data is coming from (the
source) and how much data you need to load. If you don't have any
data to load when you first create the database (a highly unusual
situation), you need to concentrate only on the data-capture schemes
you plan to implement, such as data entry forms or automated capture
programs like those used in monitored environments such as
manufacturing sites and hospital intensive care units. Most likely, you'll
have to import data from comma-delimited flat files or transfer data
from other systems into your database. If you plan to import delimited
files, the bulk copy program (bcp) might be your best option. Bcp
creates minimal overhead and can quickly load data because it doesn't
generate a transaction log, so you don't have to worry about
transaction rollbacks, index updating, or constraint checking. But if you
need to import or transform (reorganize, restructure) data from other
database or nondatabase systems, you should use SQL Server's Data
Transformation Services

Step 5. Create and run security scripts. Creating security scripts is,
unfortunately, a task that you have to perform manually. You can use
the security matrix from "The Security Matrix," March 2000, as a guide
to building your SQL Server security scripts. You can also set up security
through Enterprise Manager, then have Enterprise Manager generate
the scripts.

Step 6. Create backup and other utility tasks. Now that you've created
your database and loaded some data, you need to implement your
disaster-avoidance plan. You can use the SQL Server 7.0 Database
Maintenance Plan Wizard to help set up scheduled backups for all user
and system databases. You can also use the Maintenance Plan wizard
to help set up tasks to reorganize data and index pages, update
statistics, recover unused space from database files, check for database
integrity, and generate reports about utility job activity.

Step 7. Perform miscellaneous tasks. If you aren't implementing


replication, you can take the evening off. However, if you're going to
use replication, you need to decide which type of replication to
implement: snapshot, transactional, or merge.

Structure of the file and index


An index access structure is usually constructed on a single field of a
record in a file, called an indexing field. Such an index typically stores
each value of the indexing field, along with a list of pointers to all disk
blocks that contain records with that field value. The values in the index
are usually sorted (ordered) so that we can perform an efficient binary
search on the index.

Structure of the file and index

To give you an intuitive understanding of an index, we look at library


indexes again. For example, an author index in a library will have
entries for all authors whose books are stored in the library. AUTHOR is
the indexing field and all the names are sorted according to
alphabetical order. For a particular author, the index entry will contain
locations (i.e. pointers) of all the books this author has written. If you
know the name of the author, you will be able to use this index to find
his/her books quickly. What happens if you do not have an index to
use? This is similar to using a heap file and linear search. You will have
to browse through the whole library looking for the book.

An index file is much smaller than the data file, and therefore searching
the index using a binary search can be carried out quickly. Multilevel
indexing goes one step further in the sense that it eliminates the need
for a binary search, by building indexes to the index itself. We will be
discussing these techniques later on in the chapter.

Organising files and records on disk


In this section, we will briefly define the concepts of records, record
types and files. Then we will discuss various techniques for organising
file records on disk.

Record and record type

A record is a unit which data is usually stored in. Each record is a


collection of related data items, where each item is formed of one or
more bytes and corresponds to a particular field of the record. Records
usually describe entities and their attributes. A collection of field (item)
names and their corresponding data types constitutes a record type. In
short, we may say that a record type corresponds to an entity type and
a record of a specific type represents an instance of the corresponding
entity type.

The following is an example of a record type and its record:

Figure 11.1

A specific record of the STUDENT type:

STUDENT(9901536, "James Bond", "1 Bond Street, London", "Intelligent


Services", 9)

Fixed-length and variable-length records in files

A file basically contains a sequence of records. Usually all records in a


file are of the same record type. If every record in the file has the same
size in bytes, the records are called fixed-length records. If records in
the file have different sizes, they are called variable-length records.

Variable-length records may exist in a file for the following reasons:

 Although they may be of the same type, one or more of the fields
may be of varying length. For instance, students names are of
different lengths.

 The records are of the same type, but one or more of the fields
may be a repeating field with multiple values.

 If one or more fields are optional, not all records (of the same
type) will have values for them.

 A file may contain records of different record types. In this case,


records in the file are likely to have different sizes.

repeating field needs one separator character to separate the repeating


values of the field, and another separator character to indicate
termination of the field. In short, we need to find out the exact size of a
variable-length record before allocating it to a block or blocks. It is also
apparent that programs that process files of variable-length records will
be more complex than those for fixed-length records, where the
starting position and size of each field are known and fixed.

We have seen that fixed-length records have advantages over variable-


length records with respect to storage and retrieving a field value
within the record. In some cases, therefore, it is possible and may also
be advantageous to use a fixed-length record structure to represent a
record that may logically be of variable length.
For example, we can use a fixed-length record structure that is large
enough to accommodate the largest variable-length record anticipated
in the file. For a repeating field, we could allocate as many spaces in
each record as the maximum number of values that the field can take.
In the case of optional fields, we may have every field included in every
file record. If an optional field is not applicable to a certain record, a
special null value is stored in the field. By adopting such an approach,
however, it is likely that a large amount of space will be wasted in
exchange for easier storage and retrieval.

Allocating records to blocks

The records of a file must be allocated to disk blocks because a block is


the unit of data transfer between disk and main memory. When the
record size is smaller than the block size, a block can accommodate
many such records. If a record has too large a size to be fitted in one
block, two or more blocks will have to be used.

In order to enable further discussions, suppose the size of the block is B


bytes, and a file contains fixed-length records of size R bytes. If B # R,
then we can allocate bfr = #(B/R)# records into one block, where #(x)#
is the so-called floor function which rounds the value x down to the
next integer. The value bfr is defined as the blocking factor for the file.

File headers

A file normally contains a file header or file descriptor providing


information which is needed by programs that access the file records.
The contents of a header contain information that can be used to
determine the disk addresses of the file blocks, as well as to record
format descriptions, which may include field lengths and order of fields
within a record for fixed-length unspanned records, separator
characters, and record type codes for variable-length records.

To search for a record on disk, one or more blocks are transferred into
main memory buffers. Programs then search for the desired record or
records within the buffers, using the header information.

If the address of the block that contains the desired record is not
known, the programs have to carry out a linear search through the
blocks. Each block is loaded into a buffer and checked until either the
record is found or all the blocks have been searched unsuccessfully
(which means the required record is not in the file). This can be very
time-consuming for a large file. The goal of a good file organisation is to
locate the block that contains a desired record with a minimum number
of block transfers.

Operations on files

Operations on files can usually be grouped into retrieval operations and


update operations. The former do not change anything in the file, but
only locate certain records for further processing. The latter change the
file by inserting or deleting or modifying some records.

Typically, a DBMS can issue requests to carry out the following


operations (with assistance from the operating-system file/disk
managers):

 Find (or Locate): Searches for the first record satisfying a search
condition (a condition specifying the criteria that the desired
records must satisfy). Transfers the block containing that record
into a buffer (if it is not already in main memory). The record is
located in the buffer and becomes the current record (ready to be
processed).

 Read (or Get): Copies the current record from the buffer to a
program variable. This command may also advance the current
record pointer to the next record in the file.

 FindNext: Searches for the next record in the file that satisfies the
search condition. Transfers the block containing that record into a
buffer, and the record becomes the current record.

 Delete: Deletes the current record and updates the file on disk to
reflect the change requested.

 Modify: Modifies some field values for the current record and
updates the file on disk to reflect the modification.

 Insert: Inserts a new record in the file by locating the block where
the record is to be inserted, transferring that block into a buffer,
writing the (new) record into the buffer, and writing the buffer to
the disk file to reflect the insertion.

 FindAll: Locates all the records in the file that satisfy a search
condition.

 FindOrdered: Retrieves all the records in the file in a specified


order.

 Reorganise: Rearranges records in a file according to certain


criteria. An example is the 'sort' operation, which organises
records according to the values of specified field(s).
 Open: Prepares a file for access by retrieving the file header and
preparing buffers for subsequent file operations.

 Close: Signals the end of using a file.

Before we move on, two concepts must be clarified:

 File organisation: This concept generally refers to the


organisation of data into records, blocks and access structures. It
includes the way in which records and blocks are placed on disk
and interlinked. Access structures are particularly important. They
determine how records in a file are interlinked logically as well as
physically, and therefore dictate what access methods may be
used.

 Access method: This consists of a set of programs that allow


operations to be performed on a file. Some access methods can
only be applied to files organised in certain ways. For example,
indexed access methods can only be used in indexed files.
Understanding concurrency control
Concurrency control refers to the various techniques that are
used to preserve the integrity of the database when multiple
users are updating rows at the same time. Incorrect concurrency
can lead to problems such as dirty reads, phantom reads, and
non-repeatable reads. The Microsoft JDBC Driver for SQL Server
provides interfaces to all the concurrency techniques used by SQL
Server to resolve these issues.

Concurrency Control Protocols


Different concurrency control protocols offer different benefits between
the amount of concurrency they allow and the amount of overhead
that they impose. Following are the Concurrency Control techniques in
DBMS:

 Lock-Based Protocols
 Two Phase Locking Protocol
 Timestamp-Based Protocols
 Validation-Based Protocols

Lock-based Protocols
Lock Based Protocols in DBMS is a mechanism in which a
transaction cannot Read or Write the data until it acquires an
appropriate lock. Lock based protocols help to eliminate the
concurrency problem in DBMS for simultaneous transactions by
locking or isolating a particular transaction to a single user.
A lock is a data variable which is associated with a data item. This
lock signifies that operations that can be performed on the data item.
Locks in DBMS help synchronize access to the database items by
concurrent transactions.

All lock requests are made to the concurrency-control manager.


Transactions proceed only once the lock request is granted.

Binary Locks: A Binary lock on a data item can either locked or


unlocked states.

Shared/exclusive: This type of locking mechanism separates the


locks in DBMS based on their uses. If a lock is acquired on a data
item to perform a write operation, it is called an exclusive lock.

1. Shared Lock (S):

A shared lock is also called a Read-only lock. With the shared lock,
the data item can be shared between transactions. This is because
you will never have permission to update data on the data item.

For example, consider a case where two transactions are reading the
account balance of a person. The database will let them read by
placing a shared lock. However, if another transaction wants to update
that account’s balance, shared lock prevent it until the reading process
is over.

2. Exclusive Lock (X):

With the Exclusive Lock, a data item can be read as well as written.
This is exclusive and can’t be held concurrently on the same data
item. X-lock is requested using lock-x instruction. Transactions may
unlock the data item after finishing the ‘write’ operation.

For example, when a transaction needs to update the account balance


of a person. You can allows this transaction by placing X lock on it.
Therefore, when the second transaction wants to read or write,
exclusive lock prevent this operation.

3. Simplistic Lock Protocol

This type of lock-based protocols allows transactions to obtain a lock


on every object before beginning operation. Transactions may unlock
the data item after finishing the ‘write’ operation.

4. Pre-claiming Locking

Pre-claiming lock protocol helps to evaluate operations and create a


list of required data items which are needed to initiate an execution
process. In the situation when all locks are granted, the transaction
executes. After that, all locks release when all of its operations are
over.

Starvation

Starvation is the situation when a transaction needs to wait for an


indefinite period to acquire a lock.

Following are the reasons for Starvation:

 When waiting scheme for locked items is not properly managed


 In the case of resource leak
 The same transaction is selected as a victim repeatedly

Two Phase Locking Protocol


Two Phase Locking Protocol also known as 2PL protocol is a
method of concurrency control in DBMS that ensures serializability by
applying a lock to the transaction data which blocks other transactions
to access the same data simultaneously. Two Phase Locking protocol
helps to eliminate the concurrency problem in DBMS.
This locking protocol divides the execution phase of a transaction into
three different parts.

 In the first phase, when the transaction begins to execute, it


requires permission for the locks it needs.
 The second part is where the transaction obtains all the locks.
When a transaction releases its first lock, the third phase starts.
 In this third phase, the transaction cannot demand any new
locks. Instead, it only releases the acquired locks.

What is database security?

Database security is the processes, tools, and controls that secure and
protect databases against accidental and intentional threats. The
objective of database security is to secure sensitive data and maintain
the confidentiality, availability, and integrity of the database. In
addition to protecting the data within the database, database security
protects the database management system and associated
applications, systems, physical and virtual servers, and network
infrastructure.

To answer the question "what is database security," it's important to


acknowledge that there are several types of security risks. Database
security must guard against human error, excessive employee database
privileges, hacker and insider attacks, malware, backup storage media
exposure, physical damage to database servers, and vulnerable
databases such as unpatched databases or those with too much data in
buffers.

Types of database security

To achieve the highest degree of database security, organizations need


multiple layers of data protection. To that end, a defense in depth (DiD)
security strategy places multiple controls across the IT system. If one
layer of protection fails, then another is in place to immediately prevent
the attack, as illustrated below.

Network security

 Firewalls serve as the first line of defense in DiD database


security. Logically, a firewall is a separator or restrictor of network
traffic, which can be configured to enforce your organization's
data security policy. If you use a firewall, you will increase security
at the operating system level by providing a chokepoint where
your security measures can be focused

Access management

 Authentication is the process of proving the user is who he or she


claims to be by entering the correct user ID and password. Some
security solutions allow administrators to centrally manage the
identities and permissions of database users in one central
location. This includes the minimization of password storage and
enables centralized password rotation policies.
 Authorization allows each user to access certain data objects and
perform certain database operations like read but not modify
data, modify but not delete data, or delete data.

 Access control is managed by the system administrator who


assigns permissions to a user within a database. Permissions are
ideally managed by adding user accounts to database roles and
assigning database-level permissions to those roles. For
example, row-level security (RLS) allows database administrators
to restrict read and write access to rows of data based on a user's
identity, role memberships, or query execution context. RLS
centralizes the access logic within the database itself, which
simplifies the application code and reduces the risk of accidental
data disclosure

Threat protection

 Auditing tracks database activities and helps maintain compliance


with security standards by recording database events to an audit
log. This allows you to monitor ongoing database activities, as well
as analyze and investigate historical activity to identify potential
threats or suspected abuse and security violations.

 Threat detection uncovers anomalous database activities that


indicate a potential security threat to the database and can
surface information about suspicious events directly to the
administrator.

Information protection

 Data encryption secures sensitive data by converting it into an


alternative format so only the intended parties can decipher it
back to its original form and access it. Although encryption
doesn't solve access control problems, it enhances security by
limiting data loss when access controls are bypassed. For
example, if the database host computer is misconfigured and a
malicious user obtains sensitive data, such as credit card numbers,
that stolen information might be useless if it’s encrypted.

 Database backup data and recovery is critical to protecting


information. This process involves making backup copies of the
database and log files on a regular basis and storing the copies in
a secure location. The backup copy and file are available to
restore the database in the event of a security breach or failure.

 Physical security strictly limits access to the physical server and


hardware components. Many organizations with on-premises
databases use locked rooms with restricted access for the
database server hardware and networking devices. It's also
important to limit access to backup media by storing it at a secure
offsite location.

Set database configuration parameters

Set database configuration parameters in Planning Analytics Workspace

Procedure

1. On the Databases page, click the Databases tab.

2. In the list of databases, click the database for which you want to
set configuration parameters.

3. Click the Configuration tab.


4. Locate the parameter that you want to modify, then set a new
parameter value.

5. Click Apply.

Database configuration parameters are identified on the Configuration


tab as either dynamic or static. Dynamic parameter values can be
changed without requiring a database restart. Changes to static
parameter values require a database restart.

1. You can click Restore to original values at any time to restore all
of the listed parameters to their default values.

Set database configuration parameters in Planning Analytics Workspac.

Database configuration parameters are identified on the Database


configuration page as either dynamic or static. Dynamic parameter
values can be changed without requiring a database restart. Changes to
static parameter values require a database restart.

If you modify a static parameter, you receive notification that the


database must be restarted before the new parameter value can be
applied. You can restart the database from the Planning Analytics
Monitoring dashboard.

1. You can click Restore at any time to restore all of the listed
parameters to their default values.

Optional parameters follow the restore command and positional


parameters.

The following are detailed descriptions of each of the optional


parameters:
/BACKUPDESTination=TSM|LOCAL

Use the /BACKUPDESTination parameter to specify the location from


where the backup is to be restored. The default is the value (if present)
specified in the Data Protection for SQL Server preferences file. If no
value is present, the backup is restored from IBM Spectrum Protect
server storage.

You can specify:

What is a distributed database?

A distributed database is a database that runs and stores data across


multiple computers, as opposed to doing everything on a single
machine.

Typically, distributed databases operate on two or more interconnected


servers on a computer network. Each location where a version of the
database is running is often called an instance or a node.

A distributed database, for example, might have instances running in


New York, Ohio, and California. Or it might have instances running on
three separate machines in New York. A traditional single-instance
database, in contrast, only runs in a single location on a single machine.

What is a distributed database used for?

There are different types of distributed databases and


different distributed database configuration options, but in general
distributed databases offer several advantages over traditional, single-
instance databases:
Distributing the database increases resilience and reduces risk. If a
single-instance database goes offline (due to a power outage, machine
failure, scheduled maintenance, or anything else) all of the application
services that rely on it will go offline, too. Distributed databases, in
contrast, are typically configured with replicas of the same data across
multiple instances, so if one instance goes offline, other instances can
pick up the slack, allowing the application to continue operating.

Different distributed database types and configurations handle outages


differently, but in general almost any distributed database should be
able to handle outages better than a single-instance database.

Distributed databases are generally easier to scale. As an application


grows to serve more users, the storage and computing requirements
for the database will increase over time — and not always at a
predictable rate.

Trying to keep up with this growth when using a single-instance


database is difficult – you either have to pay for more than you need so
that your database has “room to grow” in terms of storage and
computing power, or you have to navigate regular hardware upgrades
and migrations to ensure the database instance is always running on a
machine that’s capable of handling the current load.

Distributed databases, in contrast, can scale horizontally simply by


adding an additional instance or node. In some cases, this process is
manual (although it can be scripted), and in the case of serverless
databases it is entirely automated. In almost all cases, the process of
scaling a distributed database up and down is more straightforward
than trying to do the same with a single-instance database
Distributing the database can improve performance. Depending on
how it is configured, a distributed database may be able to operate
more efficiently than a single-instance database because it can spread
the computing workload between multiple instances rather than being
bottlenecked by having to perform all reads and writes on the same
machine.

Geographically distributing the database can reduce latency. Although


not all distributed databases support multi-region deployments, those
that do can also improve application performance for users by reducing
latency. When data can be located on a database instance that is
geographically close to the user who is requesting it, that user will likely
have a lower-latency application experience than a user whose
application needs to pull data from a database instance that’s (for
example) on the other side of the globe.

Depending on the specific type, configuration, and deployment choices


an organization makes, there may be additional benefits to using a
distributed database. Let’s look at some of the options that are
available when it comes to distributed databases.

Distributing the database can improve performance. Depending on


how it is configured, a distributed database may be able to operate
more efficiently than a single-instance database because it can spread
the computing workload between multiple instances rather than being
bottlenecked by having to perform all reads and writes on the same
machine.

Geographically distributing the database can reduce latency. Although


not all distributed databases support multi-region deployments, those
that do can also improve application performance for users by reducing
latency. When data can be located on a database instance that is
geographically close to the user who is requesting it, that user will likely
have a lower-latency application experience than a user whose
application needs to pull data from a database instance that’s (for
example) on the other side of the globe.

Depending on the specific type, configuration, and deployment choices


an organization makes, there may be additional benefits to using a
distributed database. Let’s look at some of the options that are
available when it comes to distributed databases.

Humans have been storing data in various formats for millennia, of


course, but the modern era of computerized databases really began
with Edgar F. Codd and the invention of the relational (SQL) database.
Relational databases store data in tables and enforce rules – called
schema – about what types of data can be stored where, and how the
data relate to each other.

Relational databases and SQL, the programming language used to


configure and query them, caught on in the 1970s and quickly became
the default database type for virtually all computerized data storage.
Transactional applications, in particular, quickly came to rely on
relational databases for their ability to support ACID transactional
guarantees – in essence, to ensure that transactions are processed
correctly, can’t interfere with each other, and remain true once they’re
committed even if the database subsequently goes offline.

After the explosion of the internet, though, it became clear that there
were limitations to the traditional relational database. In particular, it
wasn’t easy to scale, it wasn’t built to function well in cloud
environments, and distributing it across multiple instances
required complex, manual work called sharding

Distributed database configurations: active-passive vs. active-active


vs. multi-active

One of the main goals of a distributed database is high availability:


making sure the database and all of the data it contains are available at
all times. But when a database is distributed, its data is replicated
across multiple physical instances, and there are several different ways
to approach configuring those replicas.

Active-passive

The first, and simplest, is an active-passive configuration. In an active-


passive configuration, all traffic is routed to a single “active” replica,
and then copied to the other replicas for backup.

In a three-node deployment, for example, all data might be written to


an active replica on node 1 and then subsequently copied to passive
replicas on nodes 2 and 3.

This approach is straightforward, but it does introduce potential


problems. In addition to the performance bottleneck that routing all
reads and writes to a specific replica can present, problems can also
arise depending on how new data is written to the passive “follower”
replicas:

 If the data is replicated synchronously (immediately) and writing


to one of the “follower” replicas fails, then you must either
sacrifice availability (the database will become unavailable unless
all three replicas are online) or consistency (the database may
have replicas with conflicting data, as an update can be written
two the active replica but fail to write to one of the passive
follower replica

 If the data is replicated asynchronously, there’s no way to


guarantee that data makes it to the passive follower replicas (one
could be online when the data is written to the active replica but
go offline when the data is subsequently replicated to the passive
followers). This introduces the possibility of inconsistencies and
even potentially data loss.

In summary, active-passive systems offer one of the most


straightforward configuration options – particularly if you’re trying to
manually adapt a traditional relational database for a distributed
deployment. But they also introduce risks and trade-offs that can
impact database availability and consistency

Active-active

In active-active configurations, there are multiple active replicas, and


traffic is routed to all of them. This reduces the potential impact of a
replica being offline, since other replicas will handle the traffic
automatically.

However, active-active setups are much more difficult to configure for


most workloads, and it is still possible for consistency issues to arise if
an outage happens at the wrong time.

Multi-active
Multi-active is the system for availability used by CockroachDB, which
attempts to offer a better alternative to active-passive and active-active
configurations.

Like active-active configurations, all replicas can handle both reads and
writes in a multi-active system. But unlike active-active, multi-active
systems eliminate the possibility of inconsistencies by using a
consensus replication system, where writes are only committed when a
majority of replicas confirm they’ve received the write.

Pros and cons of distributed databases

We’ve already discussed the pros of distributed databases earlier in this


article, but to quickly review, the reasons to use a distributed database
are generally:

 High availability (data is replicated so that the database remains


online even if a machine goes down).

 High scalability (they can be easily scaled horizontally by adding


instances/nodes)

 Improved performance (depending on type, configuration, and


workload)

 Reduced latency (for distributed databases that support multi-


region deployments).

Beyond those, specific distributed databases may offer additional


appealing features. CockroachDB, for example, allows applications to
treat the database as though it were a single-instance deployment,
making it simpler to work with from a developer perspective. It also
offers CDC changefeeds to facilitate its use within event-driven
applications.

The cons of distributed databases also vary based on the specifics of


the database’s type, configuration, and the workloads it’ll be handling.
In general, though, potential downsides to a distributed database may
include:

 Increased operational complexity. Deploying, configuring,


managing, and optimizing a distributed database can be more
complex than working with a single-instance DB. However, many
distributed databases offer managed DBaaS deployment options
that can deal with the operational work for you.

 Increased learning curve. Distributed databases work differently


and it typically requires some time for teams to adapt to the new
set of best practices. In the case of NoSQL databases, there may
also be a learning curve for developers who aren’t familiar with
the language, as some popular NoSQL databases use proprietary
query languages. (Distributed SQL databases, on the other hand,
use a language that most developers are already familiar with:
SQL).

Database Audit and Optimization

Database servers often hold some of your organisation's most sensitive


and valuable information, thus database auditing helps to answer
questions such as: "Who accessed or changed data?", "When was it
actually changed?" and "What was the old content prior to the
change?" Your ability to answer such questions can make or break a
compliance audit.
By optimising the database you can protect it from external or internal
unauthorized access and user errors.

Following your approval Tuna consulting can perform tests on your


databases to check the security. Our experts will conduct all the
necessary assessments to find vulnerabilities and fix them before
someone else takes advantage of it.

During database auditing we will:

 Check the database for any vulnerabilities, incorrectly installed


applications and scripts.

 Investigate suspicious activity.

 Detect problems with an authorization or access control


implementation.

 Monitor and gather data about specific activities.

What is a Database Audit?

What Is a Database Audit and Why Is It Important?

A database audit requires analysis of your database, including users,


their permissions, and access to data to ensure compliance with GDPR,
HIPAA, PCI, and SOX, system integrity, penetration evaluation, and
other security vetting.

It’s essential to perform a database audit to ensure that your


organization complies with laws and regulations, to check data integrity
and database performance, and to protect against cyber threats. In
addition, a database audit performed regularly can help your
organization achieve business continuity and overall data reliability.
The Different Types of Database Audits

There are multiple types of database audits, including, but not limited
to, the following:

 Security Auditing – Security auditing verifies that robust


passwords are in place, ensures that sensitive data is protected
through encryption, and confirms that only those with proper
clearance can access the information.

 Compliance Auditing – Ensures compliance with industry


regulations and legal requirements such as GDPR, HIPAA, PCI, and
SOX. It involves reviewing the database to confirm that proper
measures are in place to protect data and that the organization is
adhering to relevant laws and regulations about data
management.

 Data Auditing - A data audit monitors and logs data access and
modifications. It allows you to trace who accessed the data and
what changes were made, including identifying individuals
responsible for adding, modifying, or deleting data. It also enables
tracking of when these changes are made.

 Configuration Auditing - Configuration auditing involves


monitoring and tracking the actions taken by users and database
administrators, including creating and modifying database
objects, managing user accounts, and making changes to the
database's configuration. In addition, it covers system-level
changes such as database software updates, operating system
modifications, and hardware changes.

The Benefits of Database Audit


The benefits of a database audit include security, compliance, and data
integrity. A database audit can help you ensure your organization is not
vulnerable to potential threats, remain compliant with relevant laws
and regulations such as GDPR, HIPAA, PCI, and SOX, and ensure data is
accurate, complete, and consistent.

A database audit can also help with business continuity by making sure
the database is available and accessible at all times. In addition, should
an issue occur where a database becomes corrupt or attacked, a
database audit can ensure that a disaster recovery plan is in place.

With proper auditing and tracking, which includes detailed records of


all activities that have taken place in a database, you can quickly
discover common issues during a database audit. By resolving these
errors, you can increase the performance of your database, which
would otherwise cause the database to be slow due to slow queries,
blocked processes, and other bottlenecks.

How to Perform a Database Audit

Performing a database audit depends on the needs and requirements


of your organization. Below are four key areas you should focus on
when performing a database audit.

Audit Access and Authentication

Analyze data on user login attempts and review access control settings,
including authentication methods.

Audit User and Administrator


Collect and analyze data on user and administrator actions to review
what they’ve done, including whether they have created and modified
database objects, user accounts, and any other configuration changes.

Monitor Security Activity

Collect and analyze data on security-related events, including firewall


logs, intrusion detection system alerts, and system-wide changes, to
monitor for unusual or suspicious activity.

How to Correct the Issues Discovered during a Database Audit

Correcting issues discovered during a database audit depends on the


type of database audited. The first thing a database auditor needs to do
is review the audit report and understand the identified issues. Then
create a plan of action to address the issues. Before making any
corrections to a database, it is recommended that you make backups in
case you need to revert the database to its original state.

Resolve issues by applying patches and updates and ensuring the


database is running the latest version. Additionally, more advanced
changes can be made in the database’s configuration settings for
security, such as authentication and access control. Database
administrators can also reorganize database objects such as tables and
indexes to improve performance, as this can also resolve issues.

Once your changes and updates are made, monitor the database
carefully to ensure no additional issues are discovered. For best
practice, performing a database audit after making changes and
updates ensures the database is running properly

Database Audit Vulnerability and Threat Detection


Identify and assess vulnerabilities and threats to the database, such as
unpatched software, passwords that need to be stronger, and access by
unauthorized users.

Database administrators use tools such as Lynis, a security auditing tool


for Linux, to help with database audits. It is free, open source, and can
be modified or extended based on your preferences. To identify and
prioritize security issues and vulnerabilities, Lynis provides detailed
reports of your database’s security-related configuration settings. As a
result, database administrators can schedule regular security
assessments and identify potential issues early.

Advance Data Structures

Advanced data structures are one of the most important disciplines of


data science since they are used for storing, organizing,
managing data and information to make it more efficient, easier to
access, and modify. They are the foundation for designing and
developing efficient and effective software and algorithms. Knowing
how to create and construct a decent data structure is essential for
being a skilled programmer. With the rise of new information
technology, working practices, its scope is likewise expanding.

The efficiency and performance of many of the algorithms directly


depend upon the data that particular algorithm is using for calculations
and other data operations. Data structures perform the basic operation
of storing data during the run-time or execution of the program in the
main memory, so memory management is also an important factor as it
will directly affect the space complexity of a program or algorithm. So
choosing an appropriate data structure plays an important role in
program efficiency and performance.
What is Searching in Data Structure?

The process of finding the desired information from the set of items
stored in the form of elements in the computer memory is referred to
as ‘searching in data structure’. These sets of items are in various
forms, such as an array, tree, graph, or linked list. Another way of
defining searching in the data structure is by locating the desired
element of specific characteristics in a collection of items.

Our learners also read: Data structures and Algorithms free course!

Searching Methods in Data Structures

Searching in the data structure can be done by implementing searching


algorithms to check for or retrieve an element from any form of stored
data structure. These algorithms are categorised based on their type of
search operation, such as:

 Sequential search

The array or list of elements is traversed sequentially while checking


every component of the set.

For example, Linear Search.

 Interval Search

Algorithms designed explicitly for searching in sorted data structures


are included in the interval search. The efficiency of these algorithms is
far better than linear search algorithms.

For example, Binary Search, Logarithmic Search.


These methods are examined based on the time taken by an algorithm
to search an element matching the search item in the data collections
and are given by,

For example, Binary Search, Logarithmic Search.

These methods are examined based on the time taken by an algorithm


to search an element matching the search item in the data collections
and are given by,

 The best possible time

 The average time

 The worst-case time

The primary concerns are regarding worst-case times that lead to


guaranteed predictions of the algorithm’s performance and are also
easy to calculate compared to average times.

There are numerous searching algorithms in a data structure such as


linear search, binary search, interpolation search, jump search,
exponential search, Fibonacci search, sublist search, the ubiquitous
binary search, unbounded binary search, recursive function for
substring search, and recursive program to search an element linearly
in the given array. The article is restricted to linear and binary search
algorithms and their working principles.

Let’s get detailed insight into the linear search and binary search in the
data structure.

Linear Search
The linear search algorithm searches all elements in the array
sequentially. Its best execution time is one, whereas the worst
execution time is n, where n is the total number of items in the search
array.

It is the most simple search algorithm in data structure and checks each
item in the set of elements until it matches the search element until the
end of data collection. When data is unsorted, a linear search algorithm
is preferred.

Binary Search

This algorithm finds specific items by comparing the middlemost items


in the data collection. When a match occurs, it returns the index of the
item. When the middle item is greater than the item, it searches for a
central item of the left sub-array. In contrast, if the middle item is
smaller than the search item, it explores the middle of the item in the
right sub-array. It continues searching for an item until it finds it or until
the sub-arrays size becomes zero.

Binary search needs sorted order of items. It is faster than a linear


search algorithm. It works on the divide and conquers principle.

Run-time complexity = O(log n)

The binary search algorithm has complexities as given below:

 Worst-case complexity = O (n log n)

 Average complexity = O (n log n)

 Best case complexity = O (1)

Interpolation Search
It is an improved variant of the binary search algorithm and works on
the search element’s probing position. Similar to binary search
algorithms, it works efficiently only on sorted data collection.

Worst execution time = O(n)

When the target element’s location is known in the data collection, an


interpolation search is used. To find a number in the telephone
directory, if one wants to search Monica’s telephone number, instead
of using linear or binary search, one can directly probe to memory
space storage where names start from ‘M’.

Hashing

One of the most widely used searching techniques in data structure,


the underlying method of hashing transforms how we access and
retrieve data. Fundamental to hashing are hash functions, which
convert input data into fixed-size values called hashes. Hashing allows
constant access, providing a direct path to the element of interest.

To explain searching in data structure, let’s examine the intricacies of


creating hash functions and overcoming potential obstacles like
collisions as we go into the principles of hashing.

Understanding Hashing

Hashing is essentially the same as a secret code for data. An input (or
key) is passed to a hash function, which converts it into a fixed-length
string of characters—typically a combination of integers and letters.
The generated hash is then used to search data structures, usually an
array, as an index or address to find the corresponding data.

Compromises in hashing
Hashing has trade-offs, even if it has constant-time access appeal. The
quality of the hash function determines how efficient hashing is; poorly
constructed methods can increase collisions and reduce performance.
Furthermore, overly complicated hash functions could introduce
computational costs.

Depth-First Search (DFS)

When we move from what is searching in data structure in linear


structures to the more complex domain of trees, Depth-First Search
(DFS) becomes a key method to investigate tree branch searching in
DS.

The structural diversity of trees and graphs is easily accommodated by


DFS in algorithms for searching. The implementation is elegant because
of its recursive nature, which mimics the innate recursive structure of
trees. The traversal’s depth-first design is advantageous when focusing
on taking a path and working your way to the end rather than
examining other options.

Let’s examine the versatility and effectiveness of DFS for searching


operation in data structure by exploring its uses in various tree-based
structures, such as binary trees and graphs.

Breadth-First Search (BFS)

Breadth-First Search (BFS) is a logical and systematic way to explore a


tree’s levels. In contrast to Depth-First Search (DFS), BFS chooses a
different approach by focusing on the shallowest levels before going
deeper.
Let’s examine the complexities of BFS, how to use it for search in data
structure, its benefits, and applications.

How Breadth-First Search Works

BFS goes through a tree or graph level by level, methodically


investigating every node at each level before going on to the next. The
method ensures a thorough examination of the entire structure by
gradually covering each level, starting from the root (or a selected
node). BFS uses a queue data structure to keep track of the node
processing order, which promotes a systematic and well-organized
traversal.

linear search or sequential search is a method for finding an element


within a list. It sequentially checks each element of the list until a match
is found or the whole list has been searched.[1]

A linear search runs in at worst linear time and makes at


most n comparisons, where n is the length of the list. If each element is
equally likely to be searched, then linear search has an average case
of n+1/2 comparisons, but the average case can be affected if the
search probabilities for each element vary. Linear search is rarely
practical because other search algorithms and schemes, such as
the binary search algorithm and hash tables, allow significantly faster
searching for all but short lists.[2]

dichotomic search is a search algorithm that operates by selecting


between two distinct alternatives (dichotomies) at each step. It is a
specific type of divide and conquer algorithm. A well-known example
is binary search.[1]
Abstractly, a dichotomic search can be viewed as following edges of an
implicit binary tree structure until it reaches a leaf (a goal or final state).
This creates a theoretical tradeoff between the number of possible
states and the running time: given k comparisons, the algorithm can
only reach O(2k) possible states and/or possible goals.

Some dichotomic searches only have results at the leaves of the tree,
such as the Huffman tree used in Huffman coding, or the
implicit classification tree used in Twenty Questions. Other dichotomic
searches also have results in at least some internal nodes of the tree,
such as a dichotomic search table for Morse code. There is thus some
looseness in the definition. Though there may indeed be only two paths
from any node, there are thus three possibilities at each step: choose
one onwards path or the other, or' stop at this node.

Dichotomic searches are often used in repair manuals, sometimes


graphically illustrated with a flowchart similar to a fault tree.

Sorting Algorithms

Sorting Algorithms are fundamental concepts that play a vital role in


data manipulation and organisation. Understanding their mechanics,
use cases, and complexity is key for an efficient and comprehensive
handling of data arrays.

Introduction to Sorting Algorithms

Understanding Sorting Algorithms is an essential part of any exploration


into the field of Computer Science. These marvellous procedures are
used to organise items in a specific order, making it far easier and
efficient to access, analyse, and manipulate data.

What are Sorting Algorithms?

Sorting Algorithms in Computer Science: These are specific procedures


used for organising data in a particular order (usually ascending or
descending), thus allowing for more efficient data handling and
manipulation.

A simple example of a Sorting Algorithm is the Bubble Sort, this


algorithm works by repeatedly stepping through the list of items,
comparing each pair and swapping them if they are in the wrong order
until the list is sorted.

Sorting Algorithms play a critical role in many areas of Computer


Science and are part of almost every application that involves data
manipulation. They are categorised based on multiple factors such as:

 Computational complexity

 Stability

 When it comes to Handling large amounts of data, the need for


Sorting Algorithms becomes evident. Let's dive into why Sorting
Algorithms occupy such a pivotal role in Computer Science.

 Sorting Algorithms are integral to optimising the efficiency of


other algorithms and data handling in general. They help in
quickly locating and accessing data from a database, improving
the speed of inputting and retrieving data.
 They are also essential for efficient Management of resources. By
professionally organising and managing data, resources like
memory and processing power can be used more efficiently,
leading to better performance.

o Computational complexity: This is an analysis When it comes


to Handling large amounts of data, the need for Sorting
Algorithms becomes evident. Let's dive into why Sorting
Algorithms occupy such a pivotal role in Computer Science.

Lastly, Sorting Algorithms play a decisive role in the field of data


analysis, where being able to organise and visualise data in a
structured manner can support more efficient and insightful
outcomes. For instance, if data is arranged in ascending or
descending order, patterns, trends and outliers in the data can be
identified more easily.

Types of Sorting Algorithm

Some of the most commonly used Sorting Algorithms are:

 Bubble Sort

 Selection Sort

 Insertion Sort

 Merge Sort

 Quick Sort
 Heap Sort

 Radix Sort

Bubble Sort, as the name suggests, repeatedly steps through the list,
compares each pair of adjacent items and swaps them if they are in the
wrong order. The pass through the list is repeated until the list is
sorted.

Selection Sort

Selection sort is a simple in-place comparison sort. It divides the input


into a sorted and an unsorted region, and repeatedly picks the smallest
(or largest, if you are sorting in descending order) element from the
unsorted region and moves it to the sorted region.

Insertion Sort

Another simple sorting algorithm is Insertion Sort. It builds the final


sorted array one item at a time, much like the way you sort playing
cards in your hands. The array is imagined to be divided into a sorted
and an unsorted region. Each subsequent item from the unsorted
region is inserted into the sorted region at its correct place.

Search Algorithms

A Search Algorithm is a procedure that takes in an array or data


structure, like a list or tree, and an element you are looking for. The
algorithm's purpose is to identify the location of this target element
within the given structure if it exists.

There are primarily two types of search algorithms:


 Sequential Search: Applied when the items are scattered
randomly. This method examines each element from the start to
find the item.

 Interval Search: Suitable for ordered or sorted items. This method


selectively eliminates portions to find the item

You might also like