Database System

Download as pdf or txt
Download as pdf or txt
You are on page 1of 72

Database System

Database
A database is used by an organization as an electronic way to store, manage and
retrieve information. The database has ability to organize, process and manage
information in a structured and controlled manner .

DBMS

Database Management Systems (DBMS) are software systems used to store,


retrieve, and run queries on data. A DBMS serves as an interface between an
end-user and a database, allowing users to create, read, update, and delete data
in the database.
RDBMS

An RDBMS is a type of database management system (DBMS) that stores data in


a row-based table structure which connects related data elements. An RDBMS
includes functions that maintain the security, accuracy, integrity and consistency
of the data.

DATABASE:
File orient approach and Database approach

File orient approach


File based systems were an early attempt to computerize the manual system. It
is also called a traditional based approach in which a decentralized approach was
taken where each department stored and controlled its own data with the help
of a data processing specialist. The main role of a data processing specialist was
to create the necessary computer file structures, and also manage the data
within structures and design some application programs that create reports
based on file data.

In the above figure:


Consider an example of a student's file system. The student file will contain
information regarding the student (i.e. roll no, student name, course etc.).
Similarly, we have a subject file that contains information about the subject and
the result file which contains the information regarding the result.
Some fields are duplicated in more than one file, which leads to data redundancy.
So to overcome this problem, we need to create a centralized system, i.e. DBMS
approach.
Database Approach:
A database approach is a well-organized collection of data that are related in a
meaningful way which can be accessed by different users but stored only once
in a system. The various operations performed by the DBMS system are:
Insertion, deletion, selection, sorting etc.
Basis DBMS Approach File System Approach

Meaning DBMS is a collection of data. In The file system is a collection of


DBMS, the user is not required to data. In this system, the user has
write the procedures. to write the procedures for
managing the database.

Sharing of data Due to the centralized approach, Data is distributed in many files,
data sharing is easy. and it may be of different
formats, so it isn't easy to share
data.

Data Abstraction DBMS gives an abstract view of data The file system provides the
that hides the details. detail of the data representation
and storage of data.

Security and DBMS provides a good protection It isn't easy to protect a file under
Protection mechanism. the file system.

Recovery DBMS provides a crash recovery The file system doesn't have a
Mechanism mechanism, i.e., DBMS protects the crash mechanism, i.e., if the
user from system failure. system crashes while entering
some data, then the content of
the file will be lost.

Manipulation DBMS contains a wide variety of The file system can't efficiently
Techniques sophisticated techniques to store store and retrieve the data.
and retrieve the data.
Concurrency DBMS takes care of Concurrent In the File system, concurrent
Problems access of data using some form of access has many problems like
locking. redirecting the file while deleting
some information or updating
some information.

Where to use Database approach used in large File system approach used in
systems which interrelate many large systems which interrelate
files. many files.
Cost The database system is expensive to The file system approach is
design. cheaper to design.

Data Due to the centralization of the In this, the files and application
Redundancy and database, the problems of data programs are created by different
Inconsistency redundancy and inconsistency are programmers so that there exists
controlled. a lot of duplication of data which
may lead to inconsistency.

Structure The database structure is complex to The file system approach has a
design. simple structure.

Data In this system, Data Independence In the File system approach, there
Independence exists, and it can be of two types. o exists no Data
Logical Data Independence o Independence.
Physical Data Independence

Integrity Integrity Constraints are easy to Integrity Constraints are difficult


Constraints apply. to implement in file system.

Data Models In the database approach, 3 types of In the file system approach, there
data models exist: is no concept of data models
exists.
o Hierarchal data models o
Network data models o
Relational data models
Flexibility Changes are often a necessity to the The flexibility of the system is less
content of the data stored in any as compared to the DBMS
system, and these changes are approach.
more easily with a database
approach.

Examples Oracle, SQL Server, Sybase etc. Cobol, C++ etc.

Characteristics of the Database Approach


1. Manages Information
A database manages all the information in their required fields (according to column
definition and ids).
2. Easy Operation Implementation
All the operations like insert, delete, update, search, etc. by SQL queries. It is a
flexible and easy way. For this user required information of column name and
appropriate value. Queries make it more powerful.
3. Multiple Views of Database
A view is a subset of the database or interface. It is defined according to user
requirements. Different users of the system may have different views of the
same system.
Every view contains only the data of interest to a user or a group of users.

4. Data for Specific Purpose


A database is designed for a specific purpose according to the group of user and
application interests.
For example, a database of the student management systems is designed to
maintain the record of students’ marks, fees, and attendance, etc. This data has
the specific purpose of maintaining student records.
For example, the library system has three types of users, 1. official administration
of the college, 2. librarian, and 3. students.
5. Represent Real World Applications
A database represents real-world applications. If any change in the real world
then it is reflected in the database. Example: - railway reservation system -
Maintaining records of passengers, waiting list, train arrival and departure time,
certain day, etc. related to each train.
6. Logical Relationship Between Records and Data
A database maintains a logical relationship between records and data. So a user
can access various records according to various logical conditions by a single
query from the database.
7. Insulation Between Program and Data
In the database approach, the data structure is stored in the system catalog, not
in the programs. If we want to change the structure of a file then no need to
change the program. This feature is called program-data independence.
It is not found in the file-based system.
8. Self-describing nature of a database system
A database system is referred to as self-describing by metadata. Metadata defines and
describes the data and relationships between tables in the database.
This information is used by the DBMS software or database users if needed.
This type of nature is not found in file-based systems.
9. Sharing of data and multiuser system
Current database systems can access by multiple users at the same time. This
feature is called concurrency control strategies. These strategies ensure that the
data accessed are always correct and integrated.
10. Control of data redundancy
In the database approach, each data item is stored in only one place in the
database. So single data does not repeat more than one time. This feature
improves system performance.
Redundancy is controlled by application programming and minimized when designing
the database.
11. Enforcement of integrity constraints
database approach provides the ability to define and enforce certain constraints to
ensure that users enter valid information and maintain data integrity.
A database constraint is a restriction or rule that dictates what can be entered or
edited in a table such as a postal code using a certain format or adding a valid
city in the City field.
12. Restriction of unauthorized access
A database approach should provide security to create and control different types of
user accounts and restrict unauthorized access.
It provides privileges to access or use data from a database these are read-only
access (i.e., the ability to read a file but not make changes), and read and write
privileges, (both read and modify a file).
13. Transaction processing
A database approach supports a concurrency control subsystem that ensures
that data remains consistent and valid during transaction processing even if
several users update the same information.
14. Backup and recovery facilities
Backup and recovery are methods that are used to protect your data from loss.
Backup means storing database copy to another drive or place. If a hard drive
fails or is not accessible then it recovers the database from backup places.

Data Model, Schemas and Instances


Data Models
 Helps in the real-world representation of data.
 Useful for the developer to understand the relationship between
various objects in the database.
 Helps to highlight any drawbacks of the plan and correct it at the
design stage.
 Determines the logical structure of a database (Describe the structure
of the database including data type’s relationships and constraints. )
 Easier to understand each user’s requirement of data, nature of the
data itself, and use of data in the application area. (in which manner
data can be stored, organized, and manipulated in a
database management system.)
 Supports communication between the user and database designer.

Data models characteristics are:-

 Diagrammatic representation.
 Simplicity in design.
 Define data and their relations.
 Useful for database designers.

 Application independent.
 Shared by different applications.
 No duplication representation of data.
 Consistency and structure validity.

Divided data models into 3 categories –

1. Conceptual Data Model(Object-Based models)


2. Physically-based models(Low Level Model)
3. Relational Data Model(Record-based Data models.)

1. Conceptual Data Model (Object-based Data Models)


 Design the entities, attributes, and their relationship in the real world.
 Also known as conceptual models.
 There are two types of object-based data Models  Entity-Relationship
Model  Object-oriented data model.
A. Entity-Relationship Model (ER data model)

• Defines the mapping between the entities.


• Describes the state of each entity and the tasks in the database.
• Describing data at the logical and use at view levels.
• This model considers a top-down approach for designing.

Advantages
• It is simple and easily understandable by simple diagrams.
• Easily covert ER diagrams into the record-based data model.
• Easy to understand
Disadvantages
• No standard representations for ER diagram.
• flexible in the representation depends upon the designer.
• Related to high-level designs, cannot simplify at low level (like
coding).
B. Object-Oriented Data Models
• Represents real-world objects.
• Based on the collection of objects, attributes, and their relationship.
• Consider each object in the world as an object and isolate it from the
other.
• Use inherits, encapsulation, abstraction properties.
• Mainly used for the multimedia application as well as data with Complex
relationships.
Example:- In an Employee database we have different types of employees –
Engineer, Accountant, Manager, Clark. But all these employees belong to the
Person group. The person can have different attributes like name, address, age,
and phone.
All employees inherit the attributes and functionalities from Person, we can
reuse those features in Employee.
This feature of this model is called encapsulation.

Advantages
• Re-use the attributes and functionalities (Codes) (inheritance property).
• Reduces the overhead and maintenance costs and maintains the same
data multiple times.
• No fear of misuse by other objects.
• Easily add a new class, which is inherited from a parent class, and add
new features.
• More flexible.
• Each class binds its attributes and its functionality.

Disadvantages
• Not widely developed and complete to use in the database systems, so
not accepted by the users.
• It is an approach for solving the requirement, not a technology.

2. Physical Data Models (Low Level Model)


 It describes: -

• how data are stored in computer memory,


• how they are scattered and ordered in the memory, and
• how they would be retrieved from memory.
 Represents the data at the data layer or internal layer.
 Operated by the specified user.
 Use frame memory models.
 It represents each table, its columns, and specifications, constraints like
primary key, foreign key, etc.
 Basically represents how each table is built and related to each other in
DB.

The diagram shows CLASS as the parent table and it has 2 child tables –
STUDENT and SUBJECT.

The primary key is represented at the top.

The relationship between the tables is represented by interconnected arrows


from table to table.

3. Representation of Data Model (Record based Data Models)

 Based on application and user levels of data.


 Considering the logical structure of the objects in the database.
 Defines the relationship between the data in the entities.
 For example, employee and department entities are related to each other
by the department.

There are 3 types of record based data models


1. Hierarchical,
2. Network and
3. Relational data models.

1. Hierarchical Data Models

 Describes many real-world relationships.


 First database model or oldest database models, introduced by IBM
(based on the material bill in the 1960s).

Characteristics

1. Define records as a tree-like structure.

2. Define the hierarchy of parent and child record relationships.

3. This model has two main concepts record and the parent-child
relationship.

4. Record is a collection of field values that provides information about an


entity.

5. A parent-child relationship shows a link or connectivity between two


records.

6. The data are stored as records.

7. The type of a record defines which fields the record contains.


8. This model shows that each child record has only one parent, but each
parent record can have one or more child records.

9. In order to retrieve data from a hierarchical database the whole tree


needs to be traversed starting from the root node.

Operation on the hierarchical model

We can perform various types of basic operations in the hierarchical model


which are insertion-deletion retrieval and update.

Advantage

1. Simplicity: - define the simple parent-child relationship.

2. Integrity of data: - each child node can be linked with only one parent
and a child node can be accessed (read) by only its parents.

3. Data security: - in this model accessing, updating, and deleting the child
node with proper information of parent node. Otherwise not possible.

4. Efficiency:- it defines one to many relationships between parent and


child records.

5. Handle transaction efficiently: - very efficient to handle the large


number of transactions using links or relationships which are defined by
the pointers.

Disadvantage

1. Knowledge of the physical level of the data store is required.

2. Define redundancy problem.

3. Pointers required for defining data stored reference. For this technical
skills are required.

4. Increase complexity when expanding or modifying the database change.

5. Not flexible to establish all types of relationships, mainly many to many.


6. Many to many relationships require many pointers which create
complexity.

7. Pointers need navigation to access information, which is a complicated


task.

8. Not provide query(DDL, and DML) facility.

9. Modification of the data structure and creating new relations or nodes


is very complicated.

10. Data manipulation operations (like deletion and updation) are very
complex.

11. Not useful for any specific or big-size database design and modeling.

2. Network Data Models

• Network model was standardized as the CODASYL & DBTG (conference of


data system language, database task group).
• In this model two main basic data structures are used first records and
second sets.
• Records:- contains detailed information regarding the data which are
classified into record types.
• Sets:- used for the represent relationship between record types, for this
linked list is used.
• Each set type defines three basic elements- the name of set type, an
owner record type (linked parent), and member record type( like child’s).
▪ This is the improved version of the hierarchical data model
(solves the drawbacks of the hierarchical model).
▪ It uses or defines M: N relationship. (many to many
relationships).
▪ Allow accessing more than one parent node.
▪ This model is organized like a graph.
▪ Accessing the data is easier and fast in this model.

Example:- A company has different projects and departments. Suppliers of the


company give input for the project. The project has multiple parents and each
department and supplier have multiple projects.
• Network models perform various types of operations which are 1.
Insertion 2. Deletion 3. Updation and 4. Selection

Advantage of network models

The network model provides various types of advantages in comparison to the


hierarchical model

1. Elimination data redundancy

Only one occurrence for a particular record in the database can refer to
other records using the link for pointer so no multiple occurrences of
records.

2. Lesser storage requirement

Record occurs only once without repetition so lesser storage is required


for storing the records in the database.

3. Better performance

The relationship is defined directly with the records so it gives better


performance.

This model provides different types of relationships such as one to one,


one to many, many to many, etc.
It represents real-world situations that are easily defined in the database.

4. Easy access of data

Records are accessed by the pointers so it is very easy to move from one
owner record to another.

5. Data integrity

Provides Data integrity due to owner membership relationship.

6. Enforce standards

This model is based on CODASYL DBTG.

7. Data independency

This model provides a structure of data that can be changed without


modifying the application program with the independence of data.

Disadvantage

I. Complexity

The conceptual design of this model is simple but the design at the
hardware label is very complex because a large number of pointers are
required to show the relationship between the owner and member
records.

II. Difficulty in querying data

This model programmer works with links and defined how to traverse
them to get desired information so proper technical skills are required.

III. Lack of structural independency

There is no independence between any objects. Any changes to any of


the objects will need changes to the whole model. Hence difficult to
manage.

it is very difficult to change in database structure otherwise all


application programs use modification or crash.

IV. Difficult to design the relationship


Difficult to design the relationship between the entities, since all the
entities are related in some way for this requires thorough practice and
knowledge about the design.

3. Relational Data Models

Properties of the relational model


• This model is designed to overcome the drawbacks of hierarchical and
network models.
• This model was introduced by Dr. Edgar Frank Codd (mathematician
and working for IBM in 1970).
• Most widely used database model around the world.
• It is purely based on how the records in each table are related.
• Isolates physical structure from the logical structure.
• Logical structure is defined records.
• Data is organized in the two-dimensional matrix as table structure
known as relations.
• Relationship is maintained by storing a common field.
• All the information related to a particular type is stored in rows of
that table.

• It is concerned with data, not physical storage detail.

▪ It enables the computer system to handle queries efficiently.


▪ Column and row have atomic value and the same kind.
▪ Columns are known as attributes that represent the characteristics of
an item or table example roll number, student name.
▪ It provides information about the metadata, meta-data contains
information about the table structure.
▪ Table can be joined and define its result in a table form.
▪ Each row is unique.
▪ We can retrieve columns by any order.
▪ Each table contains rows known as tuples which represent a collection
of information about separate items example customer records.
▪ Rows of the table can be reviewed oratory in a different order.
▪ Sequence it may be either ascending or descending or random.
▪ Each has a unique name.
▪ This model is considered as representing the most real-world objects
and the relationship between them.
▪ This model absolutely separates the logical view from its physical view
of data.
▪ Provides various types of operation by the queries as commands ex.
DDL, DML.
• This model is based on the mathematical concepts of set theory.

Terminology

Relation:- table.

Tuple:- row or record.

Attribute:- column or field.

Domain:- set of legal values in Colum.

Cardinality:- Number of rows in a table.

Degree:- number of columns in the table.

Primary key:- unique key on which depends other keys.

Foreign key:- primary key of one table used in another table as a reference.

Advantage

1. Simplicity
Very easy and simple to design and implement at the logical level
database using appropriate attributes and data values in the tabular
format.

2. Flexible

We can change to database structure according to our requirements.

3. Data independency

We can change the structure of the database without having any changes
without affecting the application.

4. Structural independency

This model is only concerned with data not with the structure so it
improves performance in the form of processing time and storage space. 5.
Query capability

This model belongs to high-level query language like SQL structure query
language to avoid Complex database navigation, not requiring any
pointers.

6. Mature Technology

It provides real-world objects and relationships which is easy by the use


of keys concept.

Disadvantage

Relational models have various disadvantages which are

1. Relational database uses a simple mapping of logical tables two


physical structures by the indexing and hashing technique for
accessing table data for this it used certain limits in the form of
constraint.

2. It does not work perfectly as an object-oriented database


management system.

3. It has limited ability to deal with a binary large object such as images
spreadsheets audio videos etc.
4. This model increases hardware overheads which are costly.

5. Mapping objects to a relational database can be a difficult skill to


learn.

6. To ensure Data integrity is a very difficult task because no single


application has control over the data.

7. It is not suitable for huge databases.

8. for separate the physical data information from the logical data, more
powerful system hardware’s – memory is required. This makes the cost
of the database high.

Data model Schema and Instance o The data which is stored in the database at
a particular moment of time is called an instance of the database.
o The overall design of a database is called schema. o A database schema is
the skeleton structure of the database. It represents the logical view of
the entire database.
o A schema contains schema objects like table, foreign key, primary key,
views, columns, data types, stored procedure, etc. o A database schema
can be represented by using the visual diagram. That diagram shows the
database objects and relationship with each other.
o A database schema is designed by the database designers to help
programmers whose software will interact with the database. The process
of database creation is called data modeling.
A schema diagram can display only some aspects of a schema like the name of
record type, data type, and constraints. Other aspects can't be specified through
the schema diagram. For example, the given figure neither show the data type
of each data item nor the relationship among various files.
In the database, actual data changes quite frequently. For example, in the given
figure, the database changes whenever we add a new grade or add a student.
The data at a particular moment of time is called the instance of the database.
Components of DBMS
There are many components available in the DBMS. Each component has a
significant task in the DBMS. A database environment is a collection of
components that regulates the use of data, management, and a group of data.
These components consist of people, the technique of Handel the database,
data, hardware, software, etc. there are several components available for the
DBMS.
1. Hardware
o Here the hardware means the physical part of the DBMS. Here the hardware
includes output devices like a printer, monitor, etc., and storage devices like
a hard disk. o In DBMS, information hardware is the most important visible
part. The equipment which is used for the visibility of the data is the printer,
computer, scanner, etc. This equipment is used to capture the data and
present the output to the user. o With the help of hardware, the DBMS can
access and update the database. o The server can store a large amount of
data, which can be shared with the help of the user's own system.
o
The database can be run in any system that ranges from microcomputers
to mainframe computers. And this database also provides an interface
between the real worlds to the database. o When we try to run any database
software like MySQL, we can type any commands with the help of our
keyboards, and RAM, ROM, and processor are part of our computer system.
2. Software o Software is the main component of the DBMS. o Software is
defined as the collection of programs that are used to instruct the computer
about its work. The software consists of a set of procedures, programs, and
routines associated with the computer system's operation and performance.
Also, we can say that computer software is a set of instructions that is used to
instruct the computer hardware for the operation of the computers.
o The software includes so many software like network software and
operating software. The database software is used to access the database,
and the database application performs the task. o This software has the
ability to understand the database accessing language and then convert
these languages to real database commands and then execute the
database. o This is the main component as the total database operation
works on a software or application. We can also be called as database
software the wrapper of the whole physical database, which provides an
easy interface for the user to store, update and delete the data from the
database. o Some examples of DBMS software include MySQL, Oracle, SQL
Server, dBase, FileMaker, Clipper, Foxpro, Microsoft Access, etc.
3. Data o The term data means the collection of any raw fact stored in the
database. Here the data are any type of raw material from which meaningful
information is generated.
o The database can store any form of data, such as structural data,
nonstructural data, and logical data.
The structured data are highly specific in the database and have a
structured format. But in the case of non-structural data, it is a collection
of different types of data, and these data are stored in their native format.
o We also call the database the structure of the DBMS. With the help of the
database, we can create and construct the DBMS. After the creation of the
database, we can create, access, and update that database. o The main
reason behind discovering the database is to create and manage the data
o
within the database. o Data is the most important part of the DBMS. Here
the database contains the actual data and metadata. Here metadata
means data about data. o For example, when the user stores the data in a
database, some data, such as the size of the data, the name of the data,
and some data related to the user, are stored within the database. These
data are called metadata.
4. Procedures
o The procedure is a type of general instruction or guidelines for the use of
DBMS. This instruction includes how to set up the database, how to install
the database, how to log in and log out of the database, how to manage
the database, how to take a backup of the database, and how to generate
the report of the database.
o In DBMS, with the help of procedure, we can validate the data, control the
access and reduce the traffic between the server and the clients. The
DBMS can offer better performance to extensive or complex business logic
when the user follows all the procedures correctly.
o The main purpose of the procedure is to guide the user during the
management and operation of the database. o The procedure of the
databases is so similar to the function of the database. The major
difference between the database procedure and database function is that
the database function acts the same as the SQL statement. In contrast, the
database procedure is invoked using the CALL statement of the DBMS. o
Database procedures can be created in two ways in enterprise
architecture. These two ways are as below. o The individual object or the
default object.
The operations in a container.

1. CREATE [OR REPLACE] PROCEDURE procedure_name (<Argument>


{IN, O
UT, IN OUT}

2. <Datatype>,...)

3. IS

4. Declaration section<variable, constant> ;


o
5. BEGIN

6. Execution section

7. EXCEPTION

8. Exception section

9. END
5. Database Access Language
o Database Access Language is a simple language that allows users to write
commands to perform the desired operations on the data that is stored in
the database.
o Database Access Language is a language used to write commands to
access, upsert, and delete data stored in a database. o Users can write
commands or query the database using Database Access Language before
submitting them to the database for execution.
o Through utilizing the language, users can create new databases and
tables, insert data and delete data. o Examples of database languages are
SQL (structured query language), My Access, Oracle, etc. A database
language is comprised of two languages.
1. Data Definition Language(DDL):It is used to construct a database. DDL
implements database schema at the physical, logical, and external levels.
The following commands serve as the base for all DDL commands:
o ALTER<object> o
COMMENT
o CREATE<object>
DESCRIBE<object>
o DROP<object> o

SHOW<object> o
USE<object>
o
2. Data Manipulation Language(DML): It is used to access a database. The
DML provides the statements to retrieve, modify, insert and delete the data
from the database.
The following commands serve as the base for all DML commands:
o INSERT
o UPDATE
o DELETE
o LOCK
o CALL
o EXPLAIN
o PLAN
6. People
o The people who control and manage the databases and perform different
types of operations on the database in the DBMS.
o The people include database administrator, software developer, and
Enduser.
o Database administrator-database administrator is the one who manages
the complete database management system. DBA takes care of the
security of the DBMS, its availability, managing the license keys, managing
user accounts and access, etc.
o Software developer- This user group is involved in developing and
designing the parts of DBMS. They can handle massive quantities of data,
modify and edit databases, design and develop new databases, and
troubleshoot database issues.
o End user - These days, all modern web or mobile applications store user
data. How do you think they do it? Yes, applications are programmed in
such a way that they collect user data and store the data on a DBMS
system running on their server. End users are the ones who store, retrieve,
update and delete data. o The users of the database can be classified into
different groups.
i. Native Users

ii. Online Users

iii. Sophisticated Users

iv. Specialized Users

v. Application Users

vi. DBA - Database Administrator

Schema architecture database system o The three schemas architecture is also


called ANSI/SPARC (American National Standards Institute, Standards
Planning And Requirements Committee) architecture or three-level
architecture. o This framework is used to describe the structure of a specific
database system.
o The three schemas architecture is also used to separate the user
applications and physical database.
o The three schemas architecture contains three-levels. It breaks the
database down into three different categories.
o

In the above diagram:


o It shows the DBMS architecture. o Mapping is used to transform the
request and response between various database levels of architecture.
o Mapping is not good for small DBMS because it takes more time.
o In External / Conceptual mapping, it is necessary to transform the request
from external level to conceptual schema. o In Conceptual / Internal
mapping, DBMS transform the request from the conceptual to internal
level.
Objectives of Three schema Architecture
The main objective of three level architecture is to enable multiple users to
access the same data with a personalized view while storing the underlying data
only once. Thus it separates the user's view from the physical structure of the
database. This separation is desirable for the following reasons: o Different users
need different views of the same data.
The approach in which a particular user needs to see the data may change
over time. o The users of the database should not worry about the physical
implementation and internal workings of the database such as data
compression and encryption techniques, hashing, optimization of the
internal structures etc. o All users should be able to access the same data
according to their requirements. o DBA should be able to change the
conceptual structure of the database without affecting the user's
o Internal structure of the database should be unaffected by changes to
physical aspects of the storage.
1. Internal Level

o The internal level has an internal schema which describes the physical
storage structure of the database.
o The internal schema is also known as a physical schema. o It uses the
physical data model. It is used to define that how the data will be stored
in a block.
o The physical level is used to describe complex low-level data structures in
detail.
The internal level is generally is concerned with the following activities: o Storage
space allocations.
For Example: B-Trees, Hashing etc.
o

Access paths. For Example: Specification of primary and secondary keys,


indexes, pointers and sequencing.
o Data compression and encryption techniques. o Optimization of internal
structures.
o Representation of stored fields.

2. Conceptual Level

o The conceptual schema describes the design of a database at the


conceptual level. Conceptual level is also known as logical level.
o The conceptual schema describes the structure of the whole database. o
The conceptual level describes what data are to be stored in the database
and also describes what relationship exists among those data.
o In the conceptual level, internal details such as an implementation of the
data structure are hidden.
o Programmers and database administrators work at this level. 3. External
Level

o At the external level, a database contains several schemas that sometimes


called as subschema. The subschema is used to describe the different view
of the database.
An external schema is also known as view schema. o Each view schema
describes the database part that a particular user group is interested and
hides the remaining database from that user group.
o The view schema describes the end user interaction with database
systems.
Data Independence o Data independence can be explained using the three-
schema architecture. o Data independence refers characteristic of being able
to modify the schema at one level of the database system without altering
the schema at the next higher level.
There are two types of data independence:
1. Logical Data Independence
o Logical data independence refers characteristic of being able to change
the conceptual schema without having to change the external schema.
o Logical data independence is used to separate the external level from
the conceptual view. o If we do any changes in the conceptual view of
the data, then the user view of the data would not be affected.
o Logical data independence occurs at the user interface level.
2. Physical Data Independence
o Physical data independence can be defined as the capacity to change the
internal schema without having to change the conceptual schema.
o If we do any changes in the storage size of the database system server,
then the Conceptual structure of the database will not be affected. o
Physical data independence is used to separate conceptual levels from the
internal levels.
o Physical data independence occurs at the logical interface level.
o

Fig: Data Independence Data

Dictionary
A data dictionary contains metadata i.e data about the database. The data
dictionary is very important as it contains information such as what is in the
database, who is allowed to access it, where is the database physically stored
etc. The users of the database normally don't interact with the data dictionary,
it is only handled by the database administrators.
The data dictionary in general contains information about the following −
• Names of all the database tables and their schemas.
• Details about all the tables in the database, such as their owners,
their security constraints, when they were created etc.
• Physical information about the tables such as where they are stored
and how.
• Table constraints such as primary key attributes, foreign key information
etc.
• Information about the database views that are visible.
This is a data dictionary describing a table that contains employee details.

Field Name Data Type Description Example


Field Size for
display

EmployeeNumber Integer 10 1645000001


Unique ID of each
employee

Name Text 20 Name of the David


employee Heston

Date of Birth Date/Time 10 DOB of Employee 08/03/1995

Phone Number Integer 10 Phone number of 6583648648


employee

The different types of data dictionary are −


Active Data Dictionary
If the structure of the database or its specifications change at any point of time,
it should be reflected in the data dictionary. This is the responsibility of the
database management system in which the data dictionary resides.
So, the data dictionary is automatically updated by the database management
system when any changes are made in the database. This is known as an active
data dictionary as it is self updating.
Passive Data Dictionary
This is not as useful or easy to handle as an active data dictionary. A passive data
dictionary is maintained separately to the database whose contents are stored
in the dictionary. That means that if the database is modified the database
dictionary is not automatically updated as in the case of Active Data Dictionary.
So, the passive data dictionary has to be manually updated to match the
database. This needs careful handling or else the database and data dictionary
are out of sync.

Database administration
Database administration is the function of managing and maintaining database
management systems (DBMS) software. Mainstream DBMS software such as
Oracle, IBM Db2 and Microsoft SQL Server need ongoing management. As such,
corporations that use DBMS software often hire specialized information
technology personnel called database administrators or DBAs. Responsibilities
• Installation, configuration and upgrading of Database server software and
related products.
• Evaluate Database features and Database related products.
• Establish and maintain sound backup and recovery policies and
procedures.
• Take care of the Database design and implementation.
• Implement and maintain database security (create and maintain users and
roles, assign privileges).
• Database tuning and performance monitoring.
• Application tuning and performance monitoring.
• Setup and maintain documentation and standards.
• Plan growth and changes (capacity planning).
• Work as part of a team and provide 24/7 support when required.
• Do general technical troubleshooting and give cons.
• Database recovery

There are three types of DBAs:

1. Systems DBAs (also referred to as physical DBAs, operations DBAs or


production Support DBAs): focus on the physical aspects of database
administration such as DBMS installation, configuration, patching,
upgrades, backups, restores, refreshes, performance optimization,
maintenance and disaster recovery.

2. Development DBAs: focus on the logical and development aspects of


database administration such as data model design and maintenance, DDL
(data definition language) generation, SQL writing and tuning, coding
stored procedures, collaborating with developers to help choose the most
appropriate DBMS feature/functionality and other preproduction
activities.

3. Application DBAs: usually found in organizations that have purchased 3rd


party application software such as ERP (enterprise resource planning) and
CRM (customer relationship management) systems. Examples of such
application software includes Oracle Applications, Siebel and PeopleSoft
(both now part of Oracle Corp.) and SAP. Application DBAs straddle the
fence between the DBMS and the application software and are
responsible for ensuring that the application is fully optimized for the
database and vice versa. They usually manage all the application
components that interact with the database and carry out activities such
as application installation and patching, application upgrades, database
cloning, building and running data cleanup routines, data load process
management, etc.
Database Languages
oA DBMS has appropriate languages and interfaces to express database
queries and updates. o Database languages can be used to read, store and
update the data in the database.
Types of Database Languages

1. Data Definition Language (DDL) o DDL stands for Data Definition Language. It
is used to define database structure or pattern. o It is used to create schema,
tables, indexes, constraints, etc. in the database.
o Using the DDL statements, you can create the skeleton of the database.
o Data definition language is used to store the information of metadata like
the number of tables and schemas, their names, indexes, columns in each
table, constraints, etc.
Here are some tasks that come under DDL:
o Create: It is used to create objects in the database. o Alter: It is used to
alter the structure of the database. o Drop: It is used to delete objects from
the database. o Truncate: It is used to remove all records from a table.
o Rename: It is used to rename an object.
o Comment: It is used to comment on the data dictionary.
These commands are used to update the database schema that's why they come
under Data definition language.
2. Data Manipulation Language (DML)
DML stands for Data Manipulation Language. It is used for accessing and
manipulating data in a database. It handles user requests.

Here are some tasks that come under DML:


o Select: It is used to retrieve data from a database. o Insert: It is used to
insert data into a table. o Update: It is used to update existing data
within a table. o Delete: It is used to delete all records from a table. o
Merge: It performs UPSERT operation, i.e., insert or update operations.
o Call: It is used to call a structured query language or a Java
subprogram. o Explain Plan: It has the parameter of explaining data.
o Lock Table: It controls concurrency.
3. Data Control Language (DCL) o DCL stands for Data Control Language. It is used
to retrieve the stored or saved data.
o The DCL execution is transactional. It also has rollback parameters.
(But in Oracle database, the execution of data control language does not have
the feature of rolling back.)

Here are some tasks that come under DCL:


o Grant: It is used to give user access privileges to a database.
o Revoke: It is used to take back permissions from the user.
There are the following operations which have the authorization of Revoke:
CONNECT, INSERT, USAGE, EXECUTE, DELETE, UPDATE and SELECT.
4. Transaction Control Language (TCL)
TCL is used to run the changes made by the DML statement. TCL can be grouped
into a logical transaction.
Here are some tasks that come under TCL:
o Commit: It is used to save the transaction on the database.
o Rollback: It is used to restore the database to original since the last
Commit.
o SAVEPOINT :This command is used to save the transaction temporarily. So
the users can rollback to the required point of the transaction. Database
System Architecture
Centralized Architecture of DBMS:

 Centralized architectures used mainframe computers to provide the main


processing for all system functions, including user application programs
and user interface programs, as well as all the DBMS functionality.

 Most users accessed such systems via computer terminals that did not
have processing power and only provided display capabilities.

 As prices of hardware declined, most users replaced their terminals with


PCs and workstations.

 Figure illustrates the physical components in a centralized architecture.


Gradually, DBMS systems started to exploit the available processing power
at the user side, which led to client/server DBMS architectures.
Client/Server Architectures

 The client/server architecture was developed to deal with computing


environments in which a large number of PCs, workstations, file servers,
printers, database servers, Web servers, e-mail servers, and other
software and equipment are connected via a network.

 The idea is to define specialized servers with specific functionalities.


 For example, it is possible to connect a number of PCs or small
workstations as clients to a file server that maintains the files of the client
machines.

 Another machine can be designated as a printer server by being


connected to various printers; all print requests by the clients are
forwarded to this machine.

 Web servers or e-mail servers also fall into the specialized server category.
The resources provided by specialized servers can be accessed by many
client machines.

 The client machines provide the user with the appropriate interfaces to
utilize these servers, as well as with local processing power to run local
applications.

 This concept can be carried over to other software packages, with


specialized programs—such as a CAD (computer-aided design) package—
being stored on specific server machines and being made accessible to
multiple clients.
Two-Tier Client Server Architecture:
The term "two-tier" refers to our architecture's two layers-the Client layer and
the Data layer. There are a number of client computers in the client layer that
can contact the database server. The API on the client computer will use
JDBC(Java Database Connectivity) or some other method to link the computer
to the database server. This is due to the possibility of various physical locations
for clients and database servers.

Three-Tier Client-Server Architecture:


The Business Logic Layer is an additional layer that serves as a link between the
Client layer and the Data layer in this instance. The layer where the application
programs are processed is the business logic layer, unlike a Two-tier architecture,
where queries are performed in the database server. Here, the application
programs are processed in the application server itself.
Distributed Database System:
A Distributed Database System is a kind of database that is present or divided in
more than one location, which means it is not limited to any single computer
system. It is divided over the network of various systems. The Distributed
Database System is physically present on the different systems in different
locations. This can be necessary when different users from all over the world
need to access a specific database. For a user, it should be handled in such a way
that it seems like a single database.

Parameters of Distributed Database Systems: o


Distribution:
It describes how data is physically distributed among the several sites.
o Autonomy:
It reveals the division of power inside the Database System and the degree of
autonomy enjoyed by each individual DBMS.

o Heterogeneity:
It speaks of the similarity or differences between the databases, system parts,
and data models.
Common Architecture Models of Distributed Database Systems: o
Client-Server Architecture of DDBMS:
This architecture is two level architecture where clients and servers are the
points or levels where the main functionality is divided. There is various
functionality provided by the server, like managing the transaction, managing
the data, processing the queries, and optimization. o Peer-to-peer Architecture
of DDBMS:
In this architecture, each node or peer is considered as a server as well as a client,
and it performs its database services as both (server and client). The peers
coordinate their efforts and share their resources with one another. o Multi
DBMS Architecture of DDBMS:
This is an amalgam of two or more independent Database Systems that functions
as a single integrated Database System.
Types of Distributed Database Systems:

o Homogeneous Database System:


Each site stores the same database in a Homogenous Database. Since each site
has the same database stored, so all the data management schemes, operating
system, and data structures will be the same across all sites. They are, therefore,
simple to handle.
o Heterogeneous Database System:
In this type of Database System, different sites are used to store the data and
relational tables, which makes it difficult for database administrators to do the
transactions and run the queries into the database. Additionally, one site might
not even be aware of the existence of the other sites. Different operating systems
and database applications may be used by various computers. Since each system
has its own database model to store the data, therefore it is required there
should be translation schemes to establish the connections between different
sites to transfer the data.

Distributed Data Storage:


There are two methods by which we can store the data on different sites:
o Replication:

This method involves redundantly storing the full relationship at two or more
locations. Since a complete database can be accessed from each site, it becomes
a redundant database. Systems preserve copies of the data as a result of
replication.
This has advantages because it makes more data accessible at many locations.
Additionally, query requests can now be handled in parallel.
However, there are some drawbacks as well. Data must be updated frequently.
Any changes performed at one site must be documented at every site where that
relation is stored in order to avoid inconsistent results. There is a tonne of
overhead here. Additionally, since concurrent access must now be monitored
across several sites, concurrency control becomes far more complicated. o
Fragmentation:
According to this method, the relationships are divided (i.e., broken up into
smaller pieces), and each fragment is stored at the many locations where it is
needed. To ensure there is no data loss, the pieces must be created in a way that
allows for the reconstruction of the original relation.
Since Fragmentation doesn't result in duplicate data, consistency is not a
concern.
Ways of fragmentation:
o Horizontal Fragmentation:
In Horizontal Fragmentation, the relational table or schema is broken down into
a group of one and more rows, and each row gets one fragment of the schema.
It is also called splitting by rows. o Vertical Fragmentation:
In this fragmentation, a relational table or schema is divided into some more
schemas of smaller sizes. A common candidate key must be present in each
fragment in order to guarantee a lossless join. This is also called splitting by
columns.
Note: Most of the time, a hybrid approach of replication and fragmentation is
used.
Application of Distributed Database Systems:
o Multimedia apps use it.
o The manufacturing control system also makes use of it. o Another
application is by corporate management for the information system.
o It is used in hotel chains, military command systems, etc.

Data Base Applications


1.Personal Databases:
Definition
Personal database system is the local database system which is only for one user
to store and manage the data and information on their own personal system.
There are number of applications are used in local computer to design and
managed personal database system.
Functions of personal database Support one application
Personal database management system requires only one application to store
and manage data in personal computer.
Having a few tables
Personal database management system is based on small database consisting of
few tables in local or personal computer. It is easily to handle and manage. There
is no need to install other devices to access and control the data and information.
Involve one computer
In this database management system only one computer which is involved to
store and manage database in personal computer.
Simple design
Design in database management system has much importance for storing and
controlling the data. In personal database management system, there is simple
design to store data and information.
Advantage of personal database system Fast processing
Based on the local computer the data can be processed more fast and reliable in
terms of handling. Higher security
Data is stored in personal computer does not need any special security
arrangement for authorization of data.
Disadvantage of personal database system:
Fewer amounts of data
Fewer amounts of data and information are stored in personal database
management system. There is no connectivity with other computer to get more
data.
No connectivity for external database
Personal database management system has only personal database system.
There is no connectivity with other computer system or database system to
access the data and information.

2.Workgroup Database:
The Progress Workgroup RDBMS offers many of the same powerful capabilities
as the Enterprise RDBMS.
It is optimized for workgroups of 2 to 50 concurrent users and provides a
costeffective, department-level solution that includes high performance, multi-
user support, and cross-platform interoperability - at an excellent value.
It meets the needs of workgroup applications by running on a wide variety of
hardware and operating system platforms.
Because the flexible database architecture provides excellent throughput on all
platforms, a database deployed on one machine can serve applications on other
systems and network configurations.

3.Enterprise Database:

The OpenEdge Enterprise RDBMS is designed for large user environments and
the transaction processing throughput of today's most demanding on-line
transaction processing (OLTP) applications.
Grounded in a flexible, multithreaded, multiserver architecture, the Enterprise
RDBMS is a powerful, open and large-scale enterprise database that can run
across multiple hardware platforms and networks.
The Enterprise RDBMS includes all of the functionality needed to meet the most
demanding OLTP requirements.
These capabilities include row-level locking, roll-back and roll-forward recovery,
point-in-time recovery, distributed database management with two-phase
commit, integral support for fail-over cluster availability, a complete suite of
on-line utilities and support for the OpenEdge ABL as well as industry-standard
SQL.
The unique combination of power, flexibility and ease of operation makes the
Enterprise RDBMS an ideal engine for a wide range of commercial and data
processing applications.
Sophisticated self-tuning capabilities make the Enterprise RDBMS easier to
install, tune and manage than other products. With low administration costs,
low initial cost of licenses, minimum upgrade fees and limited software
implementation costs, the Enterprise RDBMS provides a significant cost-
ofownership advantage over competing databases.

Features Unique to the Enterprise RDBMS

The ability to maximize performance and scalability is found in the following


Enterprise database capabilities: (more detailed information can be found in
the Progress Systems Administration Guide chapter on “Managing Progress
Performance”). The following summary shows the primary features that are
unique to the Enterprise RDBMS.

Large File Support

Operating systems now have the ability to support data files larger than 2
gigabytes (as an OS configuration option). The OpenEdge Enterprise database
allows you to enable (up to terabyte-size) large files for the database, which
simplifies management of your operation since there are fewer files to manage.
The use of large files also permits increased maximum capacity for the
database.

Tunable Spin Locks

-SPIN Ability to set the number of times a process retries to acquire a latch
before pausing. Uses the spin lock algorithm, which is very efficient when you
have multiple processors.
Concept of Data Warehouse
A data warehouse is a type of data management system that is designed to
enable and support business intelligence (BI) activities, especially analytics.
Data warehouses are solely intended to perform queries and analysis and often
contain large amounts of historical data. The data within a data warehouse is
usually derived from a wide range of sources such as application log files and
transaction applications.
A data warehouse centralizes and consolidates large amounts of data from
multiple sources. Its analytical capabilities allow organizations to derive
valuable business insights from their data to improve decision-making. Over
time, it builds a historical record that can be invaluable to data scientists and
business analysts. Because of these capabilities, a data warehouse can be
considered an organization’s “single source of truth.”
A typical data warehouse often includes the following elements:
A relational database to store and manage data
• An extraction, loading, and transformation (ELT) solution for preparing
the data for analysis
• Statistical analysis, reporting, and data mining capabilities
• Client analysis tools for visualizing and presenting data to business users
• Other, more sophisticated analytical applications that generate actionable
information by applying data science and artificial intelligence
(AI)algorithms, or graph and spatial features that enable more kinds of
analysis of data at scale
Benefits of a Data Warehouse
Data warehouses offer the overarching and unique benefit of allowing
organizations to analyze large amounts of variant data and extract significant
value from it, as well as to keep a historical record.
Four unique characteristics allow data warehouses to deliver this overarching
benefit. According to this definition, data warehouses are
• Subject-oriented. They can analyze data about a particular subject or
functional area (such as sales).
• Integrated. Data warehouses create consistency among different data
types from disparate sources.
• Nonvolatile. Once data is in a data warehouse, it’s stable and doesn’t
change.
• Time-variant. Data warehouse analysis looks at change over time.
A well-designed data warehouse will perform queries very quickly, deliver high
data throughput, and provide enough flexibility for end users to “slice and dice”
or reduce the volume of data for closer examination to meet a variety of
demands—whether at a high level or at a very fine, detailed level. The data
warehouse serves as the functional foundation for middleware BI
environments that provide end users with reports, dashboards, and other
interfaces.
Data Warehouse Architecture
The architecture of a data warehouse is determined by the organization’s specific
needs. Common architectures include
• Simple. All data warehouses share a basic design in which metadata,
summary data, and raw data are stored within the central repository of
the warehouse. The repository is fed by data sources on one end and
accessed by end users for analysis, reporting, and mining on the other
end.
• Simple with a staging area. Operational data must be cleaned and
processed before being put in the warehouse. Although this can be done
programmatically, many data warehouses add a staging area for data
before it enters the warehouse, to simplify data preparation.
• Hub and spoke. Adding data marts between the central repository and
end users allows an organization to customize its data warehouse to
serve various lines of business. When the data is ready for use, it is
moved to the appropriate data mart.
• Sandboxes. Sandboxes are private, secure, safe areas that allow
companies to quickly and informally explore new datasets or ways of
analyzing data without having to conform to or comply with the formal
rules and protocol of the data warehouse.
Concept of Data Mining
Data mining is the process of searching and analyzing a large batch of raw data
in order to identify patterns and extract useful information.
Companies use data mining software to learn more about their customers. It
can help them to develop more effective marketing strategies, increase sales,
and decrease costs. Data mining relies on effective data collection,
warehousing, and computer processing.
KEY TAKEAWAYS
• Data mining is the process of analyzing a large batch of information to
discern trends and patterns.
• Data mining can be used by corporations for everything from learning
about what customers are interested in or want to buy to fraud detection
and spam filtering.
• Data mining programs break down patterns and connections in data based
on what information users request or provide.
• Social media companies use data mining techniques to commodify their
users in order to generate profit.
• This use of data mining has come under criticism lately as users are often
unaware of the data mining happening with their personal information,
especially when it is used to influence preferences.
How Data Mining Works
Data mining involves exploring and analyzing large blocks of information to glean
meaningful patterns and trends. It is used in credit risk
management, fraud detection, and spam filtering. It also is a market research
tool that helps reveal the sentiment or opinions of a given group of people. The
data mining process breaks down into four steps:
• Data is collected and loaded into data warehouses on-site or on a cloud
service.
• Business analysts, management teams, and information technology
professionals access the data and determine how they want to organize it.
• Custom application software sorts and organizes the data.
• The end user presents the data in an easy-to-share format, such as a graph
or t
Data Mining Techniques
Data mining uses algorithms and various other techniques to convert large
collections of data into useful output. The most popular types of data mining
techniques include:
1. Clustering
Clustering is a technique used to represent data visually — such as in graphs
that show buying trends or sales demographics for a particular product.
Clustering refers to the process of grouping a series of different data points
based on their characteristics. By doing so, data miners can seamlessly divide
the data into subsets, allowing for more informed decisions in terms of broad
demographics (such as consumers or users) and their respective behaviors.

Examples of Clustering in Business


Clustering helps businesses manage their data more effectively. For example,
retailers can use clustering models to determine which customers buy
particular products, on which days, and with what frequency. This can help
retailers target products and services to customers in a specific demographic or
region.
Clustering can help grocery stores group products by a variety of characteristics
(brand, size, cost, flavor, etc.) and better understand their sales tendencies. It
can also help car insurance companies that want to identify a set of customers
who typically have high annual claims in order to price policies more effectively.
In addition, banks and financial institutions might use clustering to better
understand how customers use in-person versus virtual services to better plan
branch hours and staffing.
2. Association
Association rules are used to find correlations, or associations, between points
in a data set.
Data miners use association to discover unique or interesting relationships
between variables in databases. Association is often employed to help
companies determine marketing research and strategy.
Examples of Association in Business
The analysis of shopping behavior is an example of association — that is,
retailers notice in data studies that parents shopping for childcare supplies are
more likely to purchase specialty food or beverage items for themselves during
the same trip. These purchases can be analyzed through statistical association.
Association analysis carries many other uses in business. For retailers, it’s
particularly helpful in making purchasing suggestions. For example, if a
customer buys a smartphone, tablet, or video game device, association analysis
can recommend related items like cables, applicable software, and protective
cases.
Additionally, association is used by the government to employ census data and
plan for public services; it is also used by doctors to diagnose various illnesses
and conditions more effectively.
3. Data Cleaning
Data cleaning is the process of preparing data to be mined.
Data cleaning involves organizing data, eliminating duplicate or corrupted data,
and filling in any null values. When this process is complete, the most useful
information can be harvested for analysis.
Examples of Data Cleaning in Business
According to Experian, 95 percent of businesses say they have been impacted
by poor data quality. Working with incorrect data wastes time and resources,
increases analysis costs (because models need to be repeated), and often leads
to faulty analytics.
Ultimately, no matter how great their models or algorithms are, businesses
suffer when their data is incorrect, incomplete, or corrupted.
4. Data Visualization
Data visualization is the translation of data into graphic form to illustrate its
meaning to business stakeholders.
Data can be presented in visual ways through charts, graphs, maps, diagrams,
and more. This is a primary way in which data scientists display their findings.
Examples of Data Visualization in Business
Representing data visually is an important skill because it makes data readily
understandable to executives, clients, and customers. According to Markets
and Markets, the market size for global data visualization tools is expected to
nearly double (to $10.2 billion) by 2026.
Companies can make faster, more informed decisions when presented with
data that is easy to understand and interpret. Today, this is typically
accomplished through effective, visually accessible mediums such as graphs, 3D
models, and even augmented reality. As a result, it’s a good idea for aspiring
data professionals to consider learning such skills through a data science and
visualization bootcamp.
5. Classification
Classification is a fundamental technique in data mining and can be applied to
nearly every industry. It is a process in which data points from large data sets
are assigned to categories based on how they’re being used.
In data mining, classification is considered to be a form of clustering — that is,
it is useful for extracting comparable points of data for comparative analysis.
Classification is also used to designate broad groups within a demographic,
target audience, or user base through which businesses can gain stronger
insights.
Examples of Classification in Business
Financial institutions classify consumers based on many variables to market
new loans or project credit card risks. Meanwhile, weather apps classify data to
project snowfall totals and other similar figures. Grocery stores also use
classification to group products by the consumers who buy them, helping
forecast buying patterns.
6. Machine Learning
Machine learning is the process by which computers use algorithms to learn on
their own. An increasingly relevant part of modern technology, machine
learning makes computers “smarter” by teaching them how to perform tasks
based on the data they have gathered.
In data mining, machine learning’s applications are vast. Machine learning and
data mining fall under the umbrella of data science but aren’t interchangeable
terms. For instance, computers perform data mining as part of their machine
learning functions.
Examples of Machine Learning in Business
With machine learning, companies can use computers to quickly identify all
sorts of data patterns (in sales, product usage, buying habits, etc.) and develop
business plans using those insights. This is a growing need in many industries.
According to a MicroStrategy survey, 18 percent of analytics professionals said
machine learning and AI will have the most significant impact on their
strategies over the next five years. Learning more advanced topics like machine
learning is thus becoming imperative for data scientists.
7. Neural Networks
Computers process large amounts of data much faster than human brains but
don’t yet have the capacity to apply common sense and imagination in working
with the data. Neural networks are one way to help computers reason more
like humans.
Artificial neural networks attempt to digitally mimic the way the human brain
operates. Neural networks combine many computer processors (similar to the
way the brain uses neurons) to process data, make decisions, and learn as a
human would — or at least as closely as possible.
Examples of Neural Networks in Business
Neural networks have a wide range of applications. They can help businesses
predict consumer buying patterns and focus marketing campaigns on specific
demographics. They can also help retailers make accurate sales forecasts and
understand how to use dynamic pricing. Furthermore, they help to improve
diagnostic and treatment methods in healthcare, improving care and
performance.
8. Outlier Detection
Outlier detection is a key component of maintaining safe databases. Companies
use it to test for fraudulent transactions, such as abnormal credit card usage
that might suggest theft.
While other data mining methods seek to identify patterns and trends, outlier
detection looks for the unique: the data point or points that differ from the rest
or diverge from the overall sample. Outlier detection finds errors, such as data
that was input incorrectly or extracted from the wrong sample. Natural data
deviations can be instructive as well.
Examples of Outlier Detection in Business
Almost every business can benefit from understanding anomalies in their
production or distribution lines and how to fix them. Retailers can use outlier
detection to learn why their stores witness an odd increase in purchases, such
as snow shovels being bought in the summer, and how to respond to such
findings.
Generally, outlier detection is employed to enhance logistics, instill a culture of
preemptive damage control, and create a smoother environment for
customers, users, and other key groups.
9. Prediction
Predictive modeling seeks to turn data into a projection of future action or
behavior. These models examine data sets to find patterns and trends, then
calculate the probabilities of a future outcome.
Predictive modeling is among the most common uses of data mining and works
best with large data sets that represent a broad sample size.
Examples of Prediction in Business
Predictive modeling is a business imperative that impacts nearly every corner
of the public and private sectors. According to MicroStrategy, 52 percent of
global businesses consider advanced and predictive modeling their top priority
in analytics.

Predictive models can be built to determine sales projections and predict


consumer buying habits. They help manufacturers forecast distribution needs
and determine maintenance schedules. Government agencies use census data
to map population trends and project spending needs while baseball teams use
predictive models to determine contracts and build rosters.
The Data Mining Process
To be most effective, data analysts generally follow a certain flow of tasks along
the data mining process. Without this structure, an analyst may encounter an
issue in the middle of their analysis that could have easily been prevented had
they prepared for it earlier. The data mining process is usually broken into the
following steps.

Step 1: Understand the Business


Before any data is touched, extracted, cleaned, or analyzed, it is important to
understand the underlying entity and the project at hand. What are the goals
the company is trying to achieve by mining data? What is their current business
situation? What are the findings of a SWOT analysis? Before looking at any
data, the mining process starts by understanding what will define success at
the end of the process.
Step 2: Understand the Data
Once the business problem has been clearly defined, it's time to start thinking
about data. This includes what sources are available, how they will be secured
and stored, how the information will be gathered, and what the final outcome
or analysis may look like. This step also includes determining the limits of the
data, storage, security, and collection and assesses how these constraints will
affect the data mining process.
Step 3: Prepare the Data
Data is gathered, uploaded, extracted, or calculated. It is then cleaned,
standardized, scrubbed for outliers, assessed for mistakes, and checked for
reasonableness. During this stage of data mining, the data may also be checked
for size as an oversized collection of information may unnecessarily slow
computations and analysis.
Step 4: Build the Model
With our clean data set in hand, it's time to crunch the numbers. Data scientists
use the types of data mining above to search for relationships, trends,
associations, or sequential patterns. The data may also be fed into predictive
models to assess how previous bits of information may translate into future
outcomes.
Step 5: Evaluate the Results
The data-centered aspect of data mining concludes by assessing the findings of
the data model or models. The outcomes from the analysis may be aggregated,
interpreted, and presented to decision-makers that have largely been excluded
from the data mining process to this point. In this step, organizations can
choose to make decisions based on the findings.
Step 6: Implement Change and Monitor
The data mining process concludes with management taking steps in response
to the findings of the analysis. The company may decide the information was
not strong enough or the findings were not relevant, or the company may
strategically pivot based on findings. In either case, management reviews the
ultimate impacts of the business and recreates future data mining loops by
identifying new business problems or opportunities.

Different data mining processing models will have different steps, though the
general process is usually pretty similar. For example, the Knowledge Discovery
Databases model has nine steps, the CRISP-DM model has six steps, and the
SEMMA process model has five steps.1
Applications of Data Mining
In today's age of information, almost any department, industry, sector, or
company can make use of data mining.

Sales
Data mining encourages smarter, more efficient use of capital to drive revenue
growth. Consider the point-of-sale register at your favorite local coffee shop.
For every sale, that coffeehouse collects the time a purchase was made and
what products were sold. Using this information, the shop can strategically
craft its product line.
Marketing
Once the coffeehouse above knows its ideal line-up, it's time to implement the
changes. However, to make its marketing efforts more effective, the store can
use data mining to understand where its clients see ads, what demographics to
target, where to place digital ads, and what marketing strategies most resonate
with customers. This includes aligning marketing campaigns, promotional
offers, cross-sell offers, and programs to the findings of data mining.
Manufacturing
For companies that produce their own goods, data mining plays an integral part
in analyzing how much each raw material costs, what materials are being used
most efficiently, how time is spent along the manufacturing process, and what
bottlenecks negatively impact the process. Data mining helps ensure the flow
of goods is uninterrupted.
Fraud Detection
The heart of data mining is finding patterns, trends, and correlations that link
data points together. Therefore, a company can use data mining to identify
outliers or correlations that should not exist. For example, a company may
analyze its cash flow and find a reoccurring transaction to an unknown account.
If this is unexpected, the company may wish to investigate whether funds are
being mismanaged.
Human Resources
Human resources departments often have a wide range of data available for
processing including data on retention, promotions, salary ranges, company
benefits, use of those benefits, and employee satisfaction surveys. Data mining
can correlate this data to get a better understanding of why employees leave
and what entices new hires.
Customer Service
Customer satisfaction may be caused (or destroyed) for a variety of reasons.
Imagine a company that ships goods. A customer may be dissatisfied with
shipping times, shipping quality, or communications. The same customer may
be frustrated with long telephone wait times or slow e-mail responses. Data
mining gathers operational information about customer interactions and
summarizes the findings to pinpoint weak points and highlight what the
company is doing right.

Types of Data Mining


Data mining can be performed on the following types of data:
Relational Database:
A relational database is a collection of multiple data sets formally organized by
tables, records, and columns from which data can be accessed in various ways
without having to recognize the database tables. Tables convey and share
information, which facilitates data searchability, reporting, and organization.

Data warehouses:
A Data Warehouse is the technology that collects the data from various sources
within the organization to provide meaningful business insights. The huge
amount of data comes from multiple places such as Marketing and Finance.
The extracted data is utilized for analytical purposes and helps in decision-
making for a business organization. The data warehouse is designed for the
analysis of data rather than transaction processing.
Data Repositories:
The Data Repository generally refers to a destination for data storage. However,
many IT professionals utilize the term more clearly to refer to a specific kind of
setup within an IT structure. For example, a group of databases, where an
organization has kept various kinds of information.
Object-Relational Database:
A combination of an object-oriented database model and relational database
model is called an object-relational model. It supports Classes, Objects,
Inheritance, etc.
One of the primary objectives of the Object-relational data model is to close
the gap between the Relational database and the object-oriented model
practices frequently utilized in many programming languages, for example, C++,
Java, C#, and so on.
Transactional Database:
A transactional database refers to a database management system (DBMS) that
has the potential to undo a database transaction if it is not performed
appropriately. Even though this was a unique capability a very long while back,
today, most of the relational database systems support transactional data
Advantages of Data Mining o The Data Mining technique enables
organizations to obtain knowledgebased data. o Data mining enables
organizations to make lucrative modifications in operation and production.
o Compared with other statistical data applications, data mining is a
costefficient.
o Data Mining helps the decision-making process of an organization. o It
Facilitates the automated discovery of hidden patterns as well as the
prediction of trends and behaviors.
o It can be induced in the new system as well as the existing platforms. o
It is a quick process that makes it easy for new users to analyze
enormous amounts of data in a short time.
Disadvantages of Data Mining
o There is a probability that the organizations may sell useful data of
customers to other organizations for money. As per the report, American
Express has sold credit card purchases of their customers to other
organizations. o Many data mining analytics software is difficult to operate
and needs advance training to work on.
o Different data mining instruments operate in distinct ways due to the
different algorithms used in their design. Therefore, the selection of the
right data mining tools is a very challenging task. o The data mining
techniques are not precise, so that it may lead to severe consequences in
certain conditions.

Concept of Bigdata
Data which are very large in size is called Big Data. Normally we work on data of
size MB(WordDoc ,Excel) or maximum GB(Movies, Codes) but data in Peta
bytes i.e. 10^15 byte size is called Big Data. It is stated that almost 90% of
today's data has been generated in the past 3 years.
Sources of Big Data
These data come from many sources like o Social networking sites:
Facebook, Google, LinkedIn all these sites generates huge amount of
data on a day to day basis as they have billions of users worldwide.
o E-commerce site: Sites like Amazon, Flipkart, Alibaba generates huge
amount of logs from which users buying trends can be traced.
o Weather Station: All the weather station and satellite gives very huge data
which are stored and manipulated to forecast weather.
o Telecom company: Telecom giants like Airtel, Vodafone study the user
trends and accordingly publish their plans and for this they store the data
of its million users.
o Share Market: Stock exchange across the world generates huge amount
of data through its daily transaction.
Big Data Characteristics
Big Data contains a large amount of data that is not being processed by
traditional data storage or the processing unit. It is used by many multinational
companies to process the data and business of many organizations. The data
flow would exceed 150 exabytes per day before replication.
There are five v's of Big Data that explains the characteristics.
5 V's of Big Data
o Volume o

Veracity o
Variety o
Value o
Velocity
Volume
The name Big Data itself is related to an enormous size. Big Data is a vast
'volumes' of data generated from many sources daily, such as business
processes, machines, social media platforms, networks, human interactions,
and many more.
Facebook can generate approximately a billion messages, 4.5 billion times that
the "Like" button is recorded, and more than 350 million new posts are
uploaded each day. Big data technologies can handle large amounts of data.
Variety
Big Data can be structured, unstructured, and semi-structured that are being
collected from different sources. Data will only be collected
from databases and sheets in the past, But these days the data will comes in
array forms, that are PDFs, Emails, audios, SM posts, photos, videos, etc.
The data is categorized as below:

a. Structured data: In Structured schema, along with all the required


columns. It is in a tabular form. Structured Data is stored in the relational
database management system.

b. Semi-structured: In Semi-structured, the schema is not appropriately


defined, e.g., JSON, XML, CSV, TSV, and email. OLTP (Online Transaction
Processing) systems are built to work with semi-structured data. It is stored in
relations, i.e., tables.

c. Unstructured Data: All the unstructured files, log files, audio files, and
image files are included in the unstructured data. Some organizations have
much data available, but they did not know how to derive the value of data
since the data is raw.

d. Quasi-structured Data:The data format contains textual data with


inconsistent data formats that are formatted with effort and time with some
tools.
Example: Web server logs, i.e., the log file is created and maintained by some
server that contains a list of activities.
Veracity
Veracity means how much the data is reliable. It has many ways to filter or
translate the data. Veracity is the process of being able to handle and manage
data efficiently. Big Data is also essential in business development.
For example, Facebook posts with hashtags.
Value
Value is an essential characteristic of big data. It is not the data that we process
or store. It is valuable and reliable data that we store, process, and also
analyze.

Velocity
Velocity plays an important role compared to others. Velocity creates the speed
by which the data is created in real-time. It contains the linking of incoming
data sets speeds, rate of change, and activity bursts. The primary aspect of Big
Data is to provide demanding data rapidly.
Big data velocity deals with the speed at the data flows from sources like
application logs, business processes, networks, and social media sites,
sensors, mobile devices, etc.

You might also like