Itm Mod 3
Itm Mod 3
Data governance deals with the policies and processes for managing the
availability, usability, integrity, and security of the data employed in an enterprise,
with special emphasis on promoting privacy, security, data quality, and compliance
with government regulations.
A large organization will also have a database design and management group that
is responsible for defining and organizing the structure and content of the database,
and maintaining the database. The functions it performs are called database
administration.
In managing data, steps must be taken to ensure that the data in organizational
databases are accurate and remain reliable. Data that are inaccurate, untimely, or
inconsistent with other sources of information lead to incorrect decisions, product
recalls, and even financial losses.
A good database design also includes efforts to maximize data quality and
eliminate error. Some data quality problems result from redundant and inconsistent
data, but most stem from errors in data input. Organizations need to identify and
correct faulty data and establish better routines for input and editing.
A data quality audit can be performed by surveying entire data files, sample data,
and surveying end-users impressions of data quality. Data cleansing (or data
scrubbing) techniques can be used to correct data and enforce consistency among
different sets of data.
What Is Data Management?
Data management is the development and execution of processes, architectures,
policies, practices and procedures in order to manage the information generated by
an organization.
The effective management of data within any organization has grown in
importance in recent years as organizations are subject to an increasing number of
compliance regulations, large increases in storage information storage capacity and
the sheer amount of data and documents being generated by organizations. This
rate of growth is not expected to slow down as IDC predicts the amount of
information generated will increase 29 fold by 2020(^). These large sums of data
from ERP systems, CRM systems and general business documents if often referred
to as big data.
Data Independence
Data independence is the type of data transparency that matters for a
centralized DBMS. It refers to the immunity of user applications to changes made
in the definition and organization of data.
Physical data independence deals with hiding the details of the storage structure
from user applications. The application should not be involved with these issues,
since there is no difference in the operation carried out against the data.
A database system normally contains a lot of data in addition to users’ data. For
example, it stores data about data, known as metadata, to locate and retrieve data
easily. It is rather difficult to modify or update a set of metadata once it is stored in
the database. But as a DBMS expands, it needs to change over time to satisfy the
requirements of the users. If the entire data is dependent, it would become a tedious
and highly complex job.
Metadata itself follows a layered architecture, so that when we change data at one
layer, it does not affect the data at another level. This data is independent but
mapped to each other.
Logical Data Independence
Logical data is data about database, that is, it stores information about how data is
managed inside. For example, a table (relation) stored in the database and all its
constraints, applied on that relation.
Logical data independence is a kind of mechanism, which liberalizes itself from
actual data stored on the disk. If we do some changes on table format, it should not
change the data residing on the disk.
Physical Data Independence
All the schemas are logical, and the actual data is stored in bit format on the disk.
Physical data independence is the power to change the physical data without
impacting the schema or logical data.
For example, in case we want to change or upgrade the storage system itself −
suppose we want to replace hard-disks with SSD − it should not have any impact
on the logical data or schemas.
DATA CONSISTENCY
Consistency in database systems refers to the requirement that any given database
transaction must change affected data only in allowed ways. Any data written to
the database must be valid according to all defined rules,
including constraints, cascades, triggers, and any combination thereof. This does
not guarantee correctness of the transaction in all ways the application programmer
might have wanted (that is the responsibility of application-level code) but merely
that any programming errors cannot result in the violation of any defined database
constraints.[1]
Consistency, in the context of databases, states that data cannot be written that
would violate the database’s own rules for valid data. If a certain transaction
occurs that attempts to introduce inconsistent data, the entire transaction is rolled
back and an error returned to the user.
A simple rule of consistency may state that the ‘Gender’ column of a database may
only have the values ‘Male’ , ‘Female’ or ‘Unknown’. If a user attempts to enter
something else, say ‘Hermaphrodite’ then a database consistency rule kicks in and
disallows the entry of such a value.
Consistency rules can get quite elaborate, for example a bank account number must
follow a specific pattern- it must begin with a ‘C’ for checking account or ‘S’ for
savings account, then followed by 14 digits that are picked from the date and time,
in the format YYYYMMDDHHMISS.
Database consistency does not only occur at the single-record level. In our bank
example above, another consistency rule may state that the ‘Customer Name’ field
cannot be empty when creating a customer.
Consistency rules are vitally important while creating databases, as they are the
embodiment of the business rules for which the database is being created. They
also serve another important function: they make the application developers’ work
easier- it is usually much easier to define consistency rules at the database level
rather than defining them in the application that connects to the database
Data Access
Definition - What does Data Access mean?
Data access refers to a user's ability to access or retrieve data stored within a
database or other repository. Users who have data access can store, retrieve, move
or manipulate stored data, which can be stored on a wide range of hard drives and
external devices.
Oftentimes when using random access, the data is split into multiple parts or pieces
and located anywhere randomly on a disk. Sequential files are usually faster to
load and retrieve because they require fewer seek operations.
DATA ADMINISTRATION
A database administrator (DBA) directs or performs all activities related to
maintaining a successful database environment. Responsibilities include designing,
implementing, and maintaining the database system; establishing policies and
procedures pertaining to the management, security, maintenance, and use of
the database management system; and training employees in database management
and use. A DBA is expected to stay abreast of emerging technologies and new
design approaches. Typically, a DBA has either a degree in Computer Science and
some on-the-job training with a particular database product or more extensive
experience with a range of database products. A DBA is usually expected to have
experience with one or more of the major database management products, such
as Structured Query Language, SAP, and Oracle-based database management
software.
The primary role of database administration is to ensure maximum up time for the
database so that it is always available when needed. This will typically involve
proactive periodic monitoring and troubleshooting. This in turn entails some
technical skills on the part of the DBA. In addition to in-depth knowledge of the
database in question, the DBA will also need knowledge and perhaps training in
the platform (database engine and operating system) on which the database runs.
A DBA is typically also responsible for other secondary, but still critically
important, tasks and roles. Some of these include:
• Database Security: Ensuring that only authorized users have access to the
database and fortifying it against any external, unauthorized access.
• Database Tuning: Tweaking any of several parameters to optimize
performance, such as server memory allocation, file fragmentation and disk
usage.
• Backup and Recovery: It is a DBA's role to ensure that the database has
adequate backup and recovery procedures in place to recover from any
accidental or deliberate loss of data.
• Producing Reports from Queries: DBAs are frequently called upon to
generate reports by writing queries, which are then run against the database.
It is clear from all the above that the database administration function requires
technical training and years of experience. Some companies that offer commercial
database products, such as Oracle DB and Microsoft's SQL Server, also offer
certifications for their specific products. These industry certifications, such as
Oracle Certified Professional (OCP) and Microsoft Certified Database
Administrator (MCDBA), go a long way toward assuring organizations that a DBA
is indeed thoroughly trained on the product in question. Because most relational
database products today use the SQL language, knowledge of SQL commands and
syntax is also a valuable asset for today's DBAs
MANAGING CONCURRENCY
In a multiprogramming environment where multiple transactions can be executed
simultaneously, it is highly important to control the concurrency of transactions.
We have concurrency control protocols to ensure atomicity, isolation, and
serializability of concurrent transactions. Concurrency control protocols can be
broadly divided into two categories −
• Lock based protocols
• Time stamp based protocols
Lock-based Protocols
Database systems equipped with lock-based protocols use a mechanism by which
any transaction cannot read or write data until it acquires an appropriate lock on it.
Locks are of two kinds −
• Binary Locks − A lock on a data item can be in two states; it is either locked
or unlocked.
• Shared/exclusive − This type of locking mechanism differentiates the locks
based on their uses. If a lock is acquired on a data item to perform a write
operation, it is an exclusive lock. Allowing more than one transaction to write
on the same data item would lead the database into an inconsistent state. Read
locks are shared because no data value is being changed.
Database Security
Definition - What does Database Security mean?
Database security refers to the collective measures used to protect and secure a
database or database management software from illegitimate use and malicious
threats and attacks.
It is a broad term that includes a multitude of processes, tools and methodologies
that ensure security within a database environment.
• Physical security of the database server and backup equipment from theft
and natural disasters
Crash Recovery
DBMS is a highly complex system with hundreds of transactions being executed
every second. The durability and robustness of a DBMS depends on its complex
architecture and its underlying hardware and system software. If it fails or crashes
amid transactions, it is expected that the system would follow some sort of
algorithm or techniques to recover lost data.
Failure Classification
To see where the problem has occurred, we generalize a failure into various
categories, as follows −
Transaction failure
A transaction has to abort when it fails to execute or when it reaches a point from
where it can’t go any further. This is called transaction failure where only a few
transactions or processes are hurt.
Reasons for a transaction failure could be −
• Logical errors − Where a transaction cannot complete because it has some
code error or any internal error condition.
• System errors − Where the database system itself terminates an active
transaction because the DBMS is not able to execute it, or it has to stop
because of some system condition. For example, in case of deadlock or
resource unavailability, the system aborts an active transaction.
System Crash
There are problems − external to the system − that may cause the system to stop
abruptly and cause the system to crash. For example, interruptions in power supply
may cause the failure of underlying hardware or software failure.
Examples may include operating system errors.
Disk Failure
In early days of technology evolution, it was a common problem where hard-disk
drives or storage drives used to fail frequently.
Disk failures include formation of bad sectors, unreachability to the disk, disk head
crash or any other failure, which destroys all or a part of disk storage.
Storage Structure
We have already described the storage system. In brief, the storage structure can be
divided into two categories −
• Volatile storage − As the name suggests, a volatile storage cannot survive
system crashes. Volatile storage devices are placed very close to the CPU;
normally they are embedded onto the chipset itself. For example, main
memory and cache memory are examples of volatile storage. They are fast
but can store only a small amount of information.
• Non-volatile storage − These memories are made to survive system crashes.
They are huge in data storage capacity, but slower in accessibility. Examples
may include hard-disks, magnetic tapes, flash memory, and non-volatile
(battery backed up) RAM.
Recovery and Atomicity
When a system crashes, it may have several transactions being executed and various
files opened for them to modify the data items. Transactions are made of various
operations, which are atomic in nature. But according to ACID properties of
DBMS, atomicity of transactions as a whole must be maintained, that is, either all
the operations are executed or none.
When a DBMS recovers from a crash, it should maintain the following −
• It should check the states of all the transactions, which were being executed.
• A transaction may be in the middle of some operation; the DBMS must ensure
the atomicity of the transaction in this case.
• It should check whether the transaction can be completed now or it needs to
be rolled back.
• No transactions would be allowed to leave the DBMS in an inconsistent state.
There are two types of techniques, which can help a DBMS in recovering as well
as maintaining the atomicity of a transaction −
• Maintaining the logs of each transaction, and writing them onto some stable
storage before actually modifying the database.
• Maintaining shadow paging, where the changes are done on a volatile
memory, and later, the actual database is updated.
What is RDBMS?
RDBMS stands for Relational Database Management System. RDBMS is the basis
for SQL, and for all modern database systems like MS SQL Server, IBM DB2,
Oracle, MySQL, and Microsoft Access.
A Relational database management system (RDBMS) is a database management
system (DBMS) that is based on the relational model as introduced by E. F. Codd.
What is a table?
The data in an RDBMS is stored in database objects which are called as tables. This
table is basically a collection of related data entries and it consists of numerous
columns and rows.
Remember, a table is the most common and simplest form of data storage in a
relational database. The following program is an example of a CUSTOMERS table
−
• +----+----------+-----+-----------+----------+
• | ID | NAME | AGE | ADDRESS | SALARY |
• +----+----------+-----+-----------+----------+
• | 1 | Ramesh | 32 | Ahmedabad | 2000.00 |
• | 2 | Khilan | 25 | Delhi | 1500.00 |
• | 3 | kaushik | 23 | Kota | 2000.00 |
• | 4 | Chaitali | 25 | Mumbai | 6500.00 |
• | 5 | Hardik | 27 | Bhopal | 8500.00 |
• | 6 | Komal | 22 | MP | 4500.00 |
• | 7 | Muffy | 24 | Indore | 10000.00 |
• +----+----------+-----+-----------+----------+
What is a field?
Every table is broken up into smaller entities called fields. The fields in the
CUSTOMERS table consist of ID, NAME, AGE, ADDRESS and SALARY.
A field is a column in a table that is designed to maintain specific information about
every record in the table.
What is a Record or a Row?
A record is also called as a row of data is each individual entry that exists in a table.
For example, there are 7 records in the above CUSTOMERS table. Following is a
single row of data or record in the CUSTOMERS table −
• +----+----------+-----+-----------+----------+
• | 1 | Ramesh | 32 | Ahmedabad | 2000.00 |
• +----+----------+-----+-----------+----------+
A record is a horizontal entity in a table.
What is a column?
A column is a vertical entity in a table that contains all information associated with
a specific field in a table.
For example, a column in the CUSTOMERS table is ADDRESS, which represents
location description and would be as shown below −
• +-----------+
• | ADDRESS |
• +-----------+
• | Ahmedabad |
• | Delhi |
• | Kota |
• | Mumbai |
• | Bhopal |
• | MP |
• | Indore |
• +----+------+
Data Integrity
The following categories of data integrity exist with each RDBMS −
• Entity Integrity − There are no duplicate rows in a table.
• Domain Integrity − Enforces valid entries for a given column by restricting
the type, the format, or the range of values.
• Referential integrity − Rows cannot be deleted, which are used by other
records.
• User-Defined Integrity − Enforces some specific business rules that do not
fall into entity, domain or referential integrity.
Database Normalization
Database normalization is the process of efficiently organizing data in a database.
There are two reasons of this normalization process −
• Eliminating redundant data, for example, storing the same data in more than
one table.
• Ensuring data dependencies make sense.
Both these reasons are worthy goals as they reduce the amount of space a database
consumes and ensures that data is logically stored. Normalization consists of a
series of guidelines that help guide you in creating a good database structure.
Normalization guidelines are divided into normal forms; think of a form as the
format or the way a database structure is laid out. The aim of normal forms is to
organize the database structure, so that it complies with the rules of first normal
form, then second normal form and finally the third normal form.
It is your choice to take it further and go to the fourth normal form, fifth normal
form and so on, but in general, the third normal form is more than enough.
data warehouse
A data warehouse is a federated repository for all the data collected by an
enterprise's various operational systems, be they physical or logical. Data
warehousing emphasizes the capture of data from diverse sources for access and
analysis rather than for transaction processing.
Typically, a data warehouse is a relational database housed on an
enterprise mainframe server or, increasingly, in the cloud. Data from various
online transaction processing (OLTP) applications and other sources are selectively
extracted for business intelligence activities, decision support and to answer user
inquiries.
A data warehouse stores data that is extracted from data stores and external
sources. The data records within the warehouse must contain details to make it
searchable and useful to business users. Taken together, there are three main
components of data warehousing:
• data sources from operational systems, such as Excel, ERP, CRM or financial
applications;
• a data staging area where data is cleaned and ordered; and
• a presentation area where data is warehoused.
Data analysis tools, such as business intelligence software, access the data within
the warehouse. Data warehouses can also feed data marts, which are decentralized
systems in which data from the warehouse is organized and made available to
specific business groups, such as sales or inventory teams.
Data Mining
Definition - What does Data Mining mean?
Data mining is the process of analyzing hidden patterns of data according to
different perspectives for categorization into useful information, which is collected
and assembled in common areas, such as data warehouses, for efficient analysis,
data mining algorithms, facilitating business decision making and other
information requirements to ultimately cut costs and increase revenue.
Data mining is also known as data discovery and knowledge discovery.
For example, a user can request that data be analyzed to display a spreadsheet
showing all of a company's beach ball products sold in Florida in the month of
July, compare revenue figures with those for the same products in September and
then see a comparison of other product sales in Florida in the same time period.
How OLAP systems work
To facilitate this kind of analysis, data is collected from multiple data sources and
stored in data warehouses then cleansed and organized into data cubes.
Each OLAP cube contains data categorized by dimensions (such as customers,
geographic sales region and time period) derived by dimensional tables in the data
warehouses. Dimensions are then populated by members (such as customer names,
countries and months) that are organized hierarchically. OLAP cubes are often pre-
summarized across dimensions to drastically improve query time over relational
databases.
Analysts can then perform five types of OLAP analytical operations against
these multidimensional databases:
OLAP products include IBM Cognos, Oracle OLAP and Oracle Essbase. OLAP
features are also included in tools such as Microsoft Excel and Microsoft SQL
Server's Analysis Services). OLAP products are typically designed for multiple-
user environments, with the cost of the software based on the number of users.
OLTP (online transaction processing)
OLTP (online transaction processing) is a class of software programs capable of
supporting transaction-oriented applications on the Internet.
Typically, OLTP systems are used for order entry, financial transactions, customer
relationship management (CRM) and retail sales. Such systems have a large
number of users who conduct short transactions. Database queries are usually
simple, require sub-second response times and return relatively few records.