IM III Unit Notes
IM III Unit Notes
IM III Unit Notes
DBMS – types and evolution, RDBMS, OODBMS, RODBMS, Data warehousing, Data Mart,
Data mining
Database - defined
A database is an organized collection of structured information, or data, typically stored
electronically in a computer system. A database is usually controlled by a database management
system (DBMS).
What is Database?
A database is a systematic collection of data. They support electronic storage and manipulation
of data. Databases make data management easy.
Together, the data and the DBMS, along with the applications that are associated with them, are
referred to as a database system, often shortened to just database.
Let us discuss a database example: Facebook. It needs to store, manipulate, and present data
related to members, their friends, member activities, messages, advertisements, and a lot more.
We can provide a countless number of examples for the usage of databases.
Types of Databases
Here are some popular types of databases.
1. Distributed databases:
A distributed database is a type of database that has contributions from the common database and
information captured by local computers. In this type of database system, the data is not in one
place and is distributed at various organizations.
2. Relational databases:
This type of database defines database relationships in the form of tables. It is also called
Relational DBMS, which is the most popular DBMS type in the market. Database example of the
RDBMS system include MySQL, Oracle, and Microsoft SQL Server database.
3. Object-oriented databases:
This type of computers database supports the storage of all data types. The data is stored in the
form of objects. The objects to be held in the database have attributes and methods that define
what to do with the data. PostgreSQL is an example of an object-oriented relational DBMS.
4. Centralized database:
It is a centralized location, and users from different backgrounds can access this data. This type
of computers databases store application procedures that help users access the data even from a
remote location.
5. Data warehouses:
Data Warehouse is to facilitate a single version of truth for a company for decision making and
forecasting. A Data warehouse is an information system that contains historical and commutative
data from single or multiple sources. Data Warehouse concept simplifies the reporting and
analysis process of the organization.
6. NoSQL databases:
NoSQL database is used for large sets of distributed data. There are a few big data performance
problems that are effectively handled by relational databases. This type of computers database is
very efficient in analyzing large-size unstructured data.
7. Graph databases:
A graph-oriented database uses graph theory to store, map, and query relationships. These kinds
of computers databases are mostly used for analyzing interconnections. For example, an
organization can use a graph database to mine data about customers from social media.
8. OLTP databases:
OLTP another database type which able to perform fast query processing and maintaining data
integrity in multi-access environments.
9. Personal database:
A personal database is used to store data stored on personal computers that are smaller and easily
manageable. The data is mostly used by the same department of the company and is accessed by
a small group of people.
10. Hierarchical:
This type of DBMS employs the “parent-child” relationship of storing data. Its structure is like a
tree with nodes representing records and branches representing fields. The windows registry used
in Windows XP is a hierarchical database example.
11. Network DBMS:
This type of DBMS supports many-to-many relations. It usually results in complex database
structures. RDM Server is an example of database management system that implements the
network model.
Some of the latest databases include
12. Open-source databases:
This kind of database stored information related to operations. It is mainly used in the field of
marketing, employee relations, customer service, of databases.
13. Cloud databases:
A cloud database is a database which is optimized or built for such a virtualized environment.
There are so many advantages of a cloud database, some of which can pay for storage capacity
and bandwidth. It also offers scalability on-demand, along with high availability.
14. Self-driving databases:
The newest and most groundbreaking type of database, self-driving databases (also known as
autonomous databases) are cloud-based and use machine learning to automate database
tuning, security, backups, updates, and other routine management tasks traditionally
performed by database administrators.
15. Multimodal database:
The multimodal database is a type of data processing platform that supports multiple data models
that define how the certain knowledge and information in a database should be organized and
arranged.
16. Document/JSON database:
In a document-oriented database, the data is kept in document collections, usually using the
XML, JSON, BSON formats. One record can store as much data as you want, in any data type
(or types) you prefer.
Database Components
DBMS
DBMS stands for Database Management System. We can break it like this DBMS =
Database + Management System.
A database management system stores data in such a way that it becomes easier to
retrieve, manipulate, and produce information. DBMS is a collection of inter-related data
and set of programs to store & access those data in an easy and effective manner.
What is database software?
Database software is used to create, edit, and maintain database files and records, enabling easier
file and record creation, data entry, data editing, updating, and reporting. The software also
handles data storage, backup and reporting, multi-access control, and security. Database software
is sometimes also referred to as a “database management system” (DBMS).
Database software makes data management simpler by enabling users to store data in a
structured form and then access it. It typically has a graphical interface to help create and
manage the data and, in some cases, users can construct their own databases by using database
software.
Need of DBMS
Database systems are basically developed for large amount of data. When dealing with
huge amount of data, there are two things that require optimization:
(i) Storage of data and (ii) retrieval of data.
Storage: According to the principles of database systems, the data is stored in such a way
that it acquires lot less space as the redundant data (duplicate data) has been removed
before storage. Let’s take a layman example to understand this:
In a banking system, suppose a customer is having two accounts, one is saving account
and another is salary account. Let’s say bank stores saving account data at one place
(these places are called tables we will learn them later) and salary account data at another
place, in that case if the customer information such as customer name, address etc. are
stored at both places then this is just a wastage of storage (redundancy/ duplication of
data), to organize the data in a better way the information should be stored at one place
and both the accounts should be linked to that information somehow. The same thing we
achieve in DBMS.
Fast Retrieval of data: Along with storing the data in an optimized and systematic
manner, it is also important that we retrieve the data quickly when needed. Database
systems ensure that the data is retrieved as quickly as possible.
Purpose of Database Systems
The main purpose of database systems is to manage the data. Consider a university that
keeps the data of students, teachers, courses, books etc. To manage this data we need to
store this data somewhere where we can add new data, delete unused data, update
outdated data, retrieve data, to perform these operations on data we need a Database
management system that allows us to store the data in such a way so that all these
operations can be performed on the data efficiently.
Applications where we use Database Management Systems are:
DBMS Architecture
The architecture of DBMS depends on the computer system on which it runs. For
example, in a client-server DBMS architecture, the database systems at server machine
can run several requests made by client machine. We will understand this communication
with the help of diagrams.
Types of DBMS Architecture
There are three
types of DBMS architecture:
1. Single tier architecture
2. Two tier architecture
3. Three tier architecture
1. Single tier architecture
In this type of architecture, the database is readily available on the client machine, any
request made by client doesn’t require a network connection to perform the action on the
database.
For example, lets say you want to fetch the records of employee from the database and
the database is available on your computer system, so the request to fetch employee
details will be done by your computer and the records will be fetched from the database
by your computer as well. This type of system is generally referred as local database
system.
2. Two tier architecture
In two-tier architecture, the Database system is present at the server machine and the
DBMS application is present at the client machine, these two machines are connected
with each other through a reliable network as shown in the above diagram.
Whenever client machine makes a request to access the database present at server using a
query language like sql, the server perform the request on the database and returns the
result back to the client. The application connection interface such as JDBC, ODBC are
used for the interaction between server and client.
3. Three tier architecture
In three-tier architecture, another layer is present between the client machine and server
machine. In this architecture, the client application doesn’t communicate directly with the
database systems present at the server machine, rather the client application
communicates with server application and the server application internally communicates
with the database system present at the server
.
The design of a database at physical level is called physical schema, how the data stored
in blocks of storage is described at this level.
Design of database at logical level is called logical schema, programmers and database
administrators work at this level, at this level data can be described as certain types of
data records gets stored in data structures, however the internal details such as
implementation of data structure is hidden at this level (available at physical level).
Design of database at view level is called view schema. This generally describes end user
interaction with database systems.
DBMS Instance
Definition of instance: The data stored in database at a particular moment of time is
called instance of database. Database schema defines the variable declarations in tables
that belong to a particular database; the value of these variables at a moment of time is
called the instance of that database.
For example, lets say we have a single table student in the database, today the table has
100 records, so today the instance of the database has 100 records. Lets say we are going
to add another 100 records in this table by tomorrow so the instance of database
tomorrow will have 200 records in table. In short, at a particular moment the data stored
in database is called the instance, that changes over time when we add or delete data from
the database.
DBMS languages
Database languages are used to read, update and store data in a database. There are
several such languages that can be used for this purpose; one of them is SQL (Structured
Query Language).
111 Ashish 23
123 Saurav 22
169 Lester 24
234 Lou 26
Table: Course
123 Steve 29
367 Chaitanya 27
234 Ajeet 28
Course Table:
• Hierarchical model was developed by IBM and North American Rockwell known as
Information Management System.
• It represents the data in a hierarchical tree structure.
• This model is the first DBMS model.
• In this model, the data is sorted hierarchically.
• It uses pointer to navigate between the stored data.
2. Relational Model
• Network Database Model is same like Hierarchical Model, but the only difference is that it
allows a record to have more than one parent.
• In this model, there is no need of parent to child association like the hierarchical model.
• It replaces the hierarchical tree with a graph.
• It represents the data as record types and one-to-many relationship.
• This model is easy to design and understand.
In this diagram,
5. Object Model
• Object model stores the data in the form of objects, classes and inheritance.
• This model handles more complex applications, such as Geographic Information System
(GIS), scientific experiments, engineering design and manufacturing.
• It is used in File Management System.
• It represents real world objects, attributes and behaviors.
• It provides a clear modular structure.
• It is easy to maintain and modify the existing code.
RDBMS Concepts
RDBMS stands for relational database management system. A relational model can be
represented as a table of rows and columns. A relational database has following major
components:
1. Table 5. Instance
2. Record or Tuple 6. Schema
3. Field or Column name or Attribute 7. Keys
4. Domain
1. Table
A table is a collection of data represented in rows and columns. Each table has a name in
database. For example, the following table “STUDENT” stores the information of
students in database.
Table: STUDENT
1 Definition RDBMS stands for Relational DataBase OODBMS stands for Object Oriented DataBase
Management System. Management System.
3 Data RDBMS handles simple data. OODBMS handles large and complex data.
Complexity
4 Term An entity refers to collection of similar An class refers to group of objects having common
items having same definition. relationships, behaviors and properties.
5 Data RDBMS handles only data. OODBMS handles both data and functions
Handling operating on that data.
7 Key A primary key identifies in object in a Object Id, OID represents an object uniquely in
table uniquely. group of objects.
Data Warehousing
A data warehouse is constructed by integrating data from multiple heterogeneous sources. It
supports analytical reporting, structured and/or ad hoc queries and decision making.
• Financial services
• Banking services
• Consumer goods
• Retail sectors
• Controlled manufacturing
What is OLAP?
Online Analytical Processing, a category of software tools which provide analysis of data for
business decisions. OLAP systems allow users to analyze database information from multiple
database systems at one time.
The primary objective is data analysis and not data processing.
Example of OLAP
Any Datawarehouse system is an OLAP system. Uses of OLAP are as follows
• A company might compare their mobile phone sales in September with sales in October,
then compare those results with another location which may be stored in a sperate
database.
• Amazon analyzes purchases by its customers to come up with a personalized homepage
with products which likely interest to their customer.
What is OLTP?
Online transaction processing shortly known as OLTP supports transaction-oriented applications
in a 3-tier architecture. OLTP administers day to day transaction of an organization.
The primary objective is data processing and not data analysis
Example of OLTP system
An example of OLTP system is ATM center. Assume that a couple has a joint account with a
bank. One day both simultaneously reach different ATM centers at precisely the same time and
want to withdraw total amount present in their bank account.
However, the person that completes authentication process first will be able to get money. In this
case, OLTP system makes sure that withdrawn amount will be never more than the amount
present in the bank. The key to note here is that OLTP systems are optimized for transactional
superiority instead data analysis.
KEY DIFFERENCE between OLTP and OLAP:
• Online Analytical Processing (OLAP) is a category of software tools that analyze data
stored in a database whereas Online transaction processing (OLTP) supports transaction-
oriented applications in a 3-tier architecture.
• OLAP creates a single platform for all type of business analysis needs which includes
planning, budgeting, forecasting, and analysis while OLTP is useful to administer day to
day transactions of an organization.
• OLAP is characterized by a large volume of data while OLTP is characterized by large
numbers of short online transactions.
• In OLAP, data warehouse is created uniquely so that it can integrate different data
sources for building a consolidated database whereas OLTP uses traditional DBMS.
OLTP vs OLAP
It is an online transactional system. It manages OLAP is an online analysis and data retrieving
Process
database modification. process.
Method OLTP uses traditional DBMS. OLAP uses the data warehouse.
Table Tables in OLTP database are normalized. Tables in OLAP database are not normalized.
OLTP and its transactions are the sources of Different OLTP databases become the source of
Source
data. data for OLAP.
OLTP database must maintain data integrity OLAP database does not get frequently modified.
Data Integrity
constraint. Hence, data integrity is not an issue.
Parameters OLTP OLAP
Response time It’s response time is in millisecond. Response time in seconds to minutes.
The data in the OLTP database is always The data in OLAP process might not be
Data quality
detailed and organized. organized.
It helps to control and run fundamental It helps with planning, problem-solving, and
Usefulness
business tasks. decision support.
Complete backup of the data combined with OLAP only need a backup from time to time.
Back-up
incremental backups. Backup is not important compared to OLTP
It is used by Data critical users like clerk, Used by Data knowledge users like workers,
User type
DBA & Data Base professionals. managers, and CEO.
This kind of Database users allows thousands This kind of Database allows only hundreds of
Number of users
of users. users.
It helps to Increase user’s self-service and Help to Increase productivity of the business
Productivity
productivity analysts.
Data Mart
Data marts contain a subset of organization-wide data that is valuable to specific groups of
people in an organization. In other words, a data mart contains only those data that is specific to
a particular group. For example, the marketing data mart may contain only data related to items,
customers, and sales. Data marts are confined to subjects.
• Data marts contain a subset of organization-wide data. This Data is valuable to a specific
group of people in an organization.
• It is cost-effective alternatives to a data warehouse, which can take high costs to build.
• Data Mart allows faster access of Data.
• Data Mart is easy to use as it is specifically designed for the needs of its users. Thus a
data mart can accelerate business processes.
• Data Marts needs less implementation time compare to Data Warehouse systems. It is
faster to implement Data Mart as you only need to concentrate the only subset of the data.
• It contains historical data which enables the analyst to determine data trends.
Disadvantages of a Data Mart
• Many a times enterprises create too many disparate and unrelated data marts without
much benefit. It can become a big hurdle to maintain.
• Data Mart cannot provide company-wide data analysis as their data set is limited.
Definition A Data Warehouse is a large repository of A data mart is an only subtype of a Data
data collected from different organizations Warehouse. It is designed to meet the need of a
or departments within a corporation. certain user group.
Usage It helps to take a strategic decision. It helps to take tactical decisions for the business.
Objective The main objective of Data Warehouse is A data mart mostly used in a business division at
to provide an integrated environment and the department level.
coherent picture of the business at a point
in time.
Designing The designing process of Data Warehouse The designing process of Data Mart is easy.
is quite difficult.
May or may not use in a dimensional It is built focused on a dimensional model using a
model. However, it can feed dimensional start schema.
models.
Data Handling Data warehousing includes large area of Data marts are easy to use, design and implement
the corporation which is why it takes a as it can only handle small amounts of data.
long time to process it.
Focus Data warehousing is broadly focused all Data Mart is subject-oriented, and it is used at a
the departments. It is possible that it can department level.
even represent the entire company.
Data type The data stored inside the Data Warehouse Data Marts are built for particular user groups.
are always detailed when compared with Therefore, data short and limited.
data mart.
Subject-area The main objective of Data Warehouse is Mostly hold only one subject area- for example,
to provide an integrated environment and Sales figure.
coherent picture of the business at a point
in time.
Data storing Designed to store enterprise-wide decision Dimensional modeling and star schema design
data, not just marketing data. employed for optimizing the performance of
access layer.
Data type Time variance and non-volatile design are Mostly includes consolidation data structures to
strictly enforced. meet subject area's query and reporting needs.
Data value Read-Only from the end-users standpoint. Transaction data regardless of grain fed directly
from the Data Warehouse.
Scope Data warehousing is more helpful as it can Data mart contains data, of a specific department
bring information from any department. of a company. There are maybe separate data
marts for sales, finance, marketing, etc. Has
limited usage
Source In Data Warehouse Data comes from many In Data Mart data comes from very few sources.
sources.
Size The size of the Data Warehouse may range The Size of Data Mart is less than 100 GB.
from 100 GB to 1 TB+.
Implementation The implementation process of Data The implementation process of Data Mart is
time Warehouse can be extended from months restricted to few months.
to years.
Data Mining
Data Mining is defined as the procedure of extracting information from huge sets of data. In
other words, we can say that data mining is mining knowledge from data.
Business understanding:
• First, you need to understand business and client objectives. You need to define what
your client wants (which many times even they do not know themselves)
• Take stock of the current data mining scenario. Factor in resources, assumption,
constraints, and other significant factors into your assessment.
• Using business objectives and current scenario, define your data mining goals.
• A good data mining plan is very detailed and should be developed to accomplish both
business and data mining goals.
Data understanding:
In this phase, sanity check on data is performed to check whether its appropriate for the data
mining goals.
• First, data is collected from multiple data sources available in the organization.
• These data sources may include multiple databases, flat filer or data cubes. There are
issues like object matching and schema integration which can arise during Data
Integration process. It is a quite complex and tricky process as data from various sources
unlikely to match easily. For example, table A contains an entity named cust_no whereas
another table B contains an entity named cust-id.
• Therefore, it is quite difficult to ensure that both of these given objects refer to the same
value or not. Here, Metadata should be used to reduce errors in the data integration
process.
• Next, the step is to search for properties of acquired data. A good way to explore the data
is to answer the data mining questions (decided in business phase) using the query,
reporting, and visualization tools.
• Based on the results of query, the data quality should be ascertained. Missing data if any
should be acquired.
Data preparation:
Evaluation:
In this phase, patterns identified are evaluated against the business objectives.
• Results generated by the data mining model should be evaluated against the business
objectives.
• Gaining business understanding is an iterative process. In fact, while understanding, new
business requirements may be raised because of data mining.
• A go or no-go decision is taken to move the model in the deployment phase.
Deployment:
In the deployment phase, you ship your data mining discoveries to everyday business operations.
• The knowledge or information discovered during data mining process should be made
easy to understand for non-technical stakeholders.
• A detailed deployment plan, for shipping, maintenance, and monitoring of data mining
discoveries is created.
• A final project report is created with lessons learned and key experiences during the
project. This helps to improve the organization's business policy.
Data Mining Techniques
1. Classification:
This analysis is used to retrieve important
and relevant information about data, and
metadata. This data mining method helps to
classify data in different classes.
2. Clustering:
Clustering analysis is a data mining
technique to identify data that are like each
other. This process helps to understand the
differences and similarities between the
data.
3. Regression:
Regression analysis is the data mining method of identifying and analyzing the relationship
between variables. It is used to identify the likelihood of a specific variable, given the presence
of other variables.
4. Association Rules:
This data mining technique helps to find the association between two or more Items. It discovers
a hidden pattern in the data set.
5. Outer detection:
This type of data mining technique refers to observation of data items in the dataset which do not
match an expected pattern or expected behavior. This technique can be used in a variety of
domains, such as intrusion, detection, fraud or fault detection, etc. Outer detection is also called
Outlier Analysis or Outlier mining.
6. Sequential Patterns:
This data mining technique helps to discover or identify similar patterns or trends in transaction
data for certain period.
7. Prediction:
Prediction has used a combination of the other data mining techniques like trends, sequential
patterns, clustering, classification, etc. It analyzes past events or instances in a right sequence for
predicting a future event.
Example 2:
A bank wants to search new ways to increase revenues from its credit card operations. They want
to check whether usage would double if fees were halved.
Bank has multiple years of record on average credit card balances, payment amounts, credit limit
usage, and other key parameters. They create a model to check the impact of the proposed new
business policy. The data results show that cutting fees in half for a targetted customer base
could increase revenues by $10 million.
Communications Data mining techniques are used in communication sector to predict customer behavior to offer
highly targetted and relevant campaigns.
Insurance Data mining helps insurance companies to price their products profitable and promote new
offers to their new or existing customers.
Education Data mining benefits educators to access student data, predict achievement levels and find
students or groups of students which need extra attention. For example, students who are weak
in maths subject.
Manufacturing With the help of Data Mining Manufacturers can predict wear and tear of production assets.
They can anticipate maintenance which helps them reduce them to minimize downtime.
Banking Data mining helps finance sector to get a view of market risks and manage regulatory
compliance. It helps banks to identify probable defaulters to decide whether to issue credit
cards, loans, etc.
Retail Data Mining techniques help retail malls and grocery stores identify and arrange most sellable
items in the most attentive positions. It helps store owners to comes up with the offer which
encourages customers to increase their spending.
Service Providers Service providers like mobile phone and utility industries use Data Mining to predict the
reasons when a customer leaves their company. They analyze billing details, customer service
interactions, complaints made to the company to assign each customer a probability score and
E-Commerce E-commerce websites use Data Mining to offer cross-sells and up-sells through their websites.
offers incentives.
One of the most famous names is Amazon, who use Data mining techniques to get more
customers into their eCommerce store.
Super Markets Data Mining allows supermarket's develope rules to predict if their shoppers were likely to be
expecting. By evaluating their buying pattern, they could find woman customers who are most
likely pregnant. They can start targeting products like baby powder, baby shop, diapers and so
Crime Data
on. Mining helps crime investigation agencies to deploy police workforce (where is a crime
Investigation most likely to happen and when?), who to search at a border crossing etc.
Bioinformatics Data Mining helps to mine biological data from massive datasets gathered in biology and
medicine.