Unit 7
Advanced Topics
Database Performance Tuning:
• Database performance tuning refers to a group of activities DBAs
perform to ensure databases operate smoothly and efficiently. It
helps re-optimize a database system from top to bottom, from
software to hardware, to improve overall performance.
• Tuning involves accelerating query response, improving indexing,
deploying clusters, and reconfiguring OS according to how they're
best used to support system function and end-user experience.
MySQL and Oracle are prominent examples of database management
systems (DBMS) on which DBAs generally perform database tuning.
Database Security
• Security of databases refers to the array of controls, tools, and procedures
designed to ensure and safeguard confidentiality, integrity, and
accessibility. This tutorial will concentrate on confidentiality because it's a
component that is most at risk in data security breaches.
Security for databases must cover and safeguard the following aspects:
• The database containing data.
• Database management systems (DBMS)
• Any applications that are associated with it.
• Physical database servers or the database server virtual, and the hardware
that runs it.
• The infrastructure for computing or network that is used to connect to the
database.
Concept of Parallel and Distributed databases
• A parallel DBMS is a DBMS that runs across multiple processors and is
designed to execute operations in parallel, whenever possible. The
parallel DBMS link a number of smaller machines to achieve the same
throughput as expected from a single large machine.
Features :
• There are parallel working of CPUs
• It improves performance
• It divides large tasks into various other tasks
• Completes works very quickly
2. Distributed Database :
• A Distributed database is defined as a logically related collection of
data that is shared which is physically distributed over a computer
network on different sites. The Distributed DBMS is defined as, the
software that allows for the management of the distributed database
and makes the distributed data available for the users.
Features :
• It is a group of logically related shared data
• The data gets split into various fragments
• There may be a replication of fragments
• The sites are linked by a communication network
Concept of Data Warehousing and Data
Mining
• Data Warehousing:
• It is a technology that aggregates structured data from one or more
sources so that it can be compared and analyzed rather than transaction
processing. A data warehouse is designed to support the management
decision-making process by providing a platform for data cleaning, data
integration, and data consolidation. A data warehouse contains subject-
oriented, integrated, time-variant, and non-volatile data. The Data
warehouse consolidates data from many sources while ensuring data
quality, consistency, and accuracy. Data warehouse improves system
performance by separating analytics processing from transnational
databases. Data flows into a data warehouse from the various databases. A
data warehouse works by organizing data into a schema that describes the
layout and type of data. Query tools analyze the data tables using schema.
Diagram:
Advantages of Data Warehousing:
• The data warehouse’s job is to make any form of corporate data
easier to understand. The majority of the user’s job will consist of
inputting raw data.
• The capacity to update continuously and frequently is the key benefit
of this technology. As a result, data warehouses are perfect for
organizations and entrepreneurs who want to stay current with their
target audience and customers.
• It makes data more accessible to businesses and organizations.
• A data warehouse holds a large volume of historical data that users
can use to evaluate different periods and trends in order to create
predictions for the future.
Disadvantages of Data Warehousing:
• There is a great risk of accumulating irrelevant and useless data. Data
loss and erasure are other potential issues.
• Data is gathered from various sources in a data warehouse. Cleansing
and transformation of the data are required. This could be a difficult
task.
Data Mining
• It is the process of finding patterns and correlations within large data
sets to identify relationships between data. Data mining tools allow a
business organization to predict customer behavior. Data mining tools
are used to build risk models and detect fraud. Data mining is used in
market analysis and management, fraud detection, corporate
analysis, and risk management.
Diagram
Advantages of Data Mining:
• Data mining aids in a variety of data analysis and sorting procedures.
The identification and detection of any undesired fault in a system is
one of the best implementations here. This method permits any
dangers to be eliminated sooner.
• In comparison to other statistical data applications, data mining
methods are both cost-effective and efficient.
• Companies can take advantage of this analytical tool by providing
appropriate and easily accessible knowledge-based data.
• The detection and identification of undesirable faults that occur in the
system are one of the most astonishing data mining techniques.
Functions of data mining:
• Forecasting:Forecasting uses historical data to predict future values or
trends.
• Risk and probability:Assessing the likelihood and potential impact of
negative events or outcomes.
• Recommendation:Providing suggestions or personalized items based
on user preferences or past behavior.
• Grouping:Grouping similar items or data points together based on
their characteristics.
• Finding Sequences: Analyzing the order of events or actions to
identify patterns and predict future events.
Disadvantages of Data Mining:
• Data mining isn’t always 100 percent accurate, and if done incorrectly,
it can lead to data breaches.
• Organizations must devote a significant amount of resources to
training and implementation. Furthermore, the algorithms used in the
creation of data mining tools cause them to work in different ways.
BigData
• Big Data is a term that is used for denoting a collection of data sets
that is large and complex, making it very difficult to process using
legacy data processing applications.
• So, legacy or traditional systems cannot process a large amount of
data in one go. But, how will you classify the data that is problematic
and hard to process.
Types of Big Data
Big Data is essentially classified into three types:
• Structured Data
• Unstructured Data
• Semi-structured Data
• Structured Data
• Structured data is highly organized and thus, is the easiest to work
with. Its dimensions are defined by set parameters. Every piece of
information is grouped into rows and columns like spreadsheets.
Structured data has quantitative data such as age, contact, address,
billing, expenses, debit or credit card numbers, etc.
Semi-structured Data
• Semi-structured data falls somewhere between structured data and
unstructured data. It mostly translates to unstructured data that has
metadata attached to it. Semi-structured data can be inherited such
as location, time, email address, or device ID stamp. It can even be a
semantic tag attached to the data later.
• Consider the example of an email. The time an email was sent, the
email addresses of the sender and the recipient, the IP address of the
device that the email was sent from, and other relevant information
are linked to the content of the email. While the actual content itself
is not structured, these components enable the data to be grouped in
a structured manner.
Unstructured Data
• Not all data is structured and well-sorted with instructions on how to
use it. All unorganized data is known as unstructured data.
• Almost everything generated by a computer is unstructured data. The
time and effort required to make unstructured data readable can be
cumbersome. To yield real value from data, datasets need to be
interpretable. But the process to make that happen can be much
more rewarding.
Characteristics of Big Data
1. Volume: This refers to tremendously large data. As you can see from the image,
the volume of data is rising exponentially. In 2016, the data created was only 8 ZB;
it is expected that, by 2020, the data would rise to 40 ZB, which is extremely large.
2. Variety: A reason for this rapid growth of data volume is that data is coming from
different sources in various formats. We have already discussed how data is
categorized into different types.
3.Velocity: The speed of data accumulation also plays a role in determining
whether the data is big data or normal data.
4. Value: How will the extraction of data work? Here, our fourth V comes in; it deals
with a mechanism to bring out the correct meaning of data. First of all, you need to
mine data, i.e., the process to turn raw data into useful data. Then, an analysis is
done on the data that you have cleaned or retrieved from the raw data.
5.Veracity: Since packages get lost during execution, we need to start again from
the stage of mining raw data to convert it into valuable data. And this process goes
on. There will also be uncertainties and inconsistencies in the data that can be
overcome by veracity. Veracity means the trustworthiness and quality of data.
Application area of Big data:
• Education
• Social Media
• Banking
• Government
• E-commerce
NoSQL databases
NoSQL is a type of database management system (DBMS) that is
designed to handle and store large volumes of unstructured and semi-
structured data. Unlike traditional relational databases that use tables
with pre-defined schemas to store data, NoSQL databases use flexible
data models that can adapt to changes in data structures and are
capable of scaling horizontally to handle growing amounts of data.
• The term NoSQL originally referred to “non-SQL” or “non-relational”
databases, but the term has since evolved to mean “not only SQL,” as
NoSQL databases have expanded to include a wide range of different
database architectures and data models.
NoSQL databases are generally classified into four
main categories:
• Document databases: These databases store data as semi-structured
documents, such as JSON (JavaScript Object Notation )or XML, and
can be queried using document-oriented query languages.
• Key-value stores: These databases store data as key-value pairs, and
are optimized for simple and fast read/write operations.
• Column-family stores: These databases store data as column families,
which are sets of columns that are treated as a single entity. They are
optimized for fast and efficient querying of large amounts of data.
• Graph databases: These databases store data as nodes and edges,
and are designed to handle complex relationships between data.
Key Features of NoSQL :
• Dynamic schema: NoSQL databases do not have a fixed schema and can accommodate changing
data structures without the need for migrations or schema alterations.
• Horizontal scalability: NoSQL databases are designed to scale out by adding more nodes to a
database cluster, making them well-suited for handling large amounts of data and high levels of
traffic.
• Document-based: Some NoSQL databases, such as MongoDB, use a document-based data model,
where data is stored in semi-structured format, such as JSON or BSON.
• Key-value-based: Other NoSQL databases, such as Redis, use a key-value data model, where data
is stored as a collection of key-value pairs.
• Column-based: Some NoSQL databases, such as Cassandra, use a column-based data model,
where data is organized into columns instead of rows.
• Distributed and high availability: NoSQL databases are often designed to be highly available and
to automatically handle node failures and data replication across multiple nodes in a database
cluster.
• Flexibility: NoSQL databases allow developers to store and retrieve data in a flexible and dynamic
manner, with support for multiple data types and changing data structures.
• Performance: NoSQL databases are optimized for high performance and can handle a high
volume of reads and writes, making them suitable for big data and real-time applications.
Advantages of NoSQL:
• High scalability : NoSQL databases use sharding for horizontal scaling.
Partitioning of data and placing it on multiple machines in such a way that
the order of the data is preserved is sharding. Vertical scaling means adding
more resources to the existing machine whereas horizontal scaling means
adding more machines to handle the data. Vertical scaling is not that easy
to implement but horizontal scaling is easy to implement. Examples of
horizontal scaling databases are MongoDB, Cassandra, etc. NoSQL can
handle a huge amount of data because of scalability, as the data grows
NoSQL scale itself to handle that data in an efficient manner.
• Flexibility: NoSQL databases are designed to handle unstructured or semi-
structured data, which means that they can accommodate dynamic
changes to the data model. This makes NoSQL databases a good fit for
applications that need to handle changing data requirements.
• High availability : Auto replication feature in NoSQL databases makes it
highly available because in case of any failure data replicates itself to the
previous consistent state.
Scalability: NoSQL databases are highly scalable, which means that they
can handle large amounts of data and traffic with ease. This makes them a
good fit for applications that need to handle large amounts of data or
traffic.
Performance: NoSQL databases are designed to handle large amounts of
data and traffic, which means that they can offer improved performance
compared to traditional relational databases.
Cost-effectiveness: NoSQL databases are often more cost-effective than
traditional relational databases, as they are typically less complex and do
not require expensive hardware or software.
Agility: Ideal for agile development.
Disadvantages of NoSQL:
• Lack of standardization : There are many different types of NoSQL databases,
each with its own unique strengths and weaknesses. This lack of standardization
can make it difficult to choose the right database for a specific application
• Lack of ACID compliance : NoSQL databases are not fully ACID-compliant, which
means that they do not guarantee the consistency, integrity, and durability of
data. This can be a drawback for applications that require strong data consistency
guarantees.
• Narrow focus : NoSQL databases have a very narrow focus as it is mainly
designed for storage but it provides very little functionality. Relational databases
are a better choice in the field of Transaction Management than NoSQL.
• Open-source : NoSQL is open-source database. There is no reliable standard for
NoSQL yet. In other words, two database systems are likely to be unequal.
• Lack of support for complex queries : NoSQL databases are not designed to
handle complex queries, which means that they are not a good fit for applications
that require complex data analysis or reporting.
• Lack of maturity : NoSQL databases are relatively new and lack the
maturity of traditional relational databases. This can make them less
reliable and less secure than traditional databases.
• Management challenge : The purpose of big data tools is to make the
management of a large amount of data as simple as possible. But it is not
so easy. Data management in NoSQL is much more complex than in a
relational database. NoSQL, in particular, has a reputation for being
challenging to install and even more hectic to manage on a daily basis.
• GUI is not available : GUI mode tools to access the database are not
flexibly available in the market.
• Backup : Backup is a great weak point for some NoSQL databases like
MongoDB. MongoDB has no approach for the backup of data in a
consistent manner.
• Large document size : Some database systems like MongoDB and CouchDB
store data in JSON format. This means that documents are quite large
(BigData, network bandwidth, speed), and having descriptive key names
actually hurts since they increase the document size.
When should NoSQL be used:
• When a huge amount of data needs to be stored and retrieved.
• The relationship between the data you store is not that important
• The data changes over time and is not structured.
• Support of Constraints and Joins is not required at the database level
• The data is growing continuously and you need to scale the database
regularly to handle the data.