0% found this document useful (0 votes)
15 views30 pages

DBMS Unit5

Master of data science certified course, Database Management System notes

Uploaded by

girab87633
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views30 pages

DBMS Unit5

Master of data science certified course, Database Management System notes

Uploaded by

girab87633
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Course: MSc DS

Advanced Database Management

Systems

Module: 5
Learning Objectives:

1. Grasp SQL and NoSQL integration concepts.

2. Master performance tuning for databases like MongoDB,

HBase, and Neo4j.

3. Develop effective ETL process skills.

4. Recognize and counter NoSQL vulnerabilities.

5. Understand GDPR and related database compliances.

6. Stay updated on new database technologies and their

roles in AI and Big Data.

Structure:

5.1 Integrated Use of Different Database Systems

5.2 Database Optimization

5.3 Security and Privacy in Database Systems

5.4 Emerging Trends in Database Systems

5.5 Summary

5.6 Keywords

5.7 Self-Assessment Questions


5.8 Case Study

5.9 Reference

5.1 Integrated Use of Different Database Systems

Polyglot Persistence is the concept of using multiple data storage

technologies to handle different data storage needs within a given

application or system. Instead of adopting a one-size-fits-all

approach, systems can utilise the best storage mechanism for

their diverse needs, resulting in optimised performance,

scalability, and flexibility.

What is Polyglot Persistence?

● Definition: Polyglot Persistence is the practice of using

different database systems within a single application based

on the specific needs and characteristics of the data.

● Origin: The term "polyglot" is derived from linguistics, where

it refers to someone who speaks multiple languages.

Similarly, in database management, it signifies the use of

multiple storage mechanisms.


Benefits and Challenges of Polyglot Persistence

● Benefits:

o Optimised Performance: Selecting the most

appropriate database system can lead to faster query

times and more efficient data handling.

o Flexibility: Different types of data (e.g., hierarchical,

relational, or document-based) can be stored in the

most suitable format.

o Scalability: Systems can scale more effectively by

deploying specific databases best suited for specific

tasks.

● Challenges:

o Increased Complexity: Managing multiple database

systems can introduce architectural complexities.

o Operational Overhead: Each database might require its

own set of operational procedures, backup strategies,

and monitoring tools.

o Integration Challenges: Ensuring data consistency and


integration across different systems can be demanding.

SQL vs. NoSQL: Bridging the Gap

● Differences and Similarities:

o SQL (Structured Query Language) databases:

o Typically relational.

o Utilise fixed schemas.

o Focus on ACID (Atomicity, Consistency, Isolation,

Durability) properties.

o NoSQL databases:

o Include document, key-value, columnar, and

graph databases.

o Often schema-less or flexible schemas.

o May prioritise BASE (Basically Available, Soft

State, Eventually Consistent) properties over

ACID.

● Cases for Integrating SQL and NoSQL:

o Hybrid Systems: Applications may require transactional

data to be stored in SQL and user profiles in NoSQL.


o Performance Optimization: Certain read-heavy

operations might be offloaded to NoSQL systems,

while transactional operations remain in SQL

databases.

o Data Variety: Integrating relational and non-relational

data can provide richer analytics and insights.

Data Integration Fundamentals

Importance of Data Integration:

● Ensures consistency and accuracy of data across multiple

systems.

● Enables comprehensive data analytics and business

intelligence.

● Facilitates the sharing of data across various business units.

Key Principles and Approaches:

● Data Consolidation: Combining data from various sources

into a single repository.

● Data Propagation: Synchronising data across systems,

either in real-time or in batches.


● Data Virtualization: Providing a unified interface to view

data across various sources without consolidating it

physically.

ETL Processes in Modern Databases

ETL Process Stages: Extract, Transform, Load:

● Extract: Data is retrieved from various source systems.

● Transform: Data is cleansed, enriched, and transformed

into a desired format.

● Load: Transformed data is loaded into the target database

or data warehouse.

Tools and Best Practices for ETL:

● Tools: Popular ETL tools include Apache NiFi, Talend,

Informatica PowerCenter, and Microsoft SQL Server

Integration Services (SSIS).

● Best Practices:

▪ Ensure data quality checks during transformation.

▪ Implement error-handling mechanisms.


▪ Schedule ETL jobs during off-peak hours to minimise

system disruptions.

▪ Keep source data unchanged to facilitate

troubleshooting and data recovery.

5.2 Database Optimization

Database performance tuning refers to the process of improving

the speed and efficiency of a database system. This often entails

analysing, troubleshooting, and adjusting resources and

configurations to improve query response times and overall

system performance.

Need for Performance Optimization

● Rapid Data Growth: With the increasing volume of data,

there is a pressing need to ensure databases perform

efficiently.

● User Expectations: Users anticipate quick response times.

Delays can lead to a poor user experience or even lost

business opportunities.
● Resource Management: Efficient databases consume fewer

resources, which can result in cost savings and better

resource utilisation.

Evaluating Database Performance

Evaluating database performance involves monitoring various

metrics such as query execution time, resource utilisation, and

throughput. Tools like Oracle’s AWR for SQL Server Profiler can

aid in this.

● Baseline Measurement: Before making any adjustments,

establish a baseline to understand the current performance.

● Identifying Bottlenecks: Pinpoint the areas where the

system is most constrained to prioritise optimization efforts.

● Continuous Monitoring: After optimization, continuous

monitoring is essential to ensure sustained performance.

Optimization Techniques Specific to Databases

● Query Optimization: Modify queries to improve efficiency.

This can involve rewriting SQL statements, using joins

effectively, or optimising subqueries.


● Hardware Optimization: Adjusting hardware configurations,

such as adding RAM or increasing storage bandwidth.

● Database Design: Effective normalisation, appropriate use

of indexes, and table partitioning.

MongoDB: Indexing and Aggregation Techniques

● Indexing: MongoDB uses B-tree structures. Proper indexing

can significantly improve query performance.

● Aggregation: MongoDB offers an aggregation framework

that processes data records and returns computed results.

HBase: Tuning Read and Write Operations

● Bloom Filters: Enhance read performance by avoiding

unnecessary disk lookups.

● WAL (Write Ahead Logging): Optimise WAL settings to

balance durability and write performance.

Cassandra: Query Optimization and Data Modeling

● Data Modeling: Use denormalization, appropriate primary

and secondary indexes, and well-designed partition keys.

● Tuning Read/Write Paths: Adjusting settings like


compaction, caching, and consistency levels to fit the

workload.

Neo4j: Efficient Graph Traversal and Indexing

● Graph Traversal: Optimise traversals by considering the

pattern and depth of traversal.

● Indexing: Neo4j uses both native and schema indexes to

enhance search performance on nodes and relationships.

Database Scalability Strategies

Scalability ensures that as the database grows, it continues to

meet performance expectations.

● Vertical Scaling: Involves adding resources, like RAM or CPU,

to a single server.

● Horizontal Scaling: Distributes the database across multiple

servers.

Introduction to Sharding

Sharding is a method of distributing data across multiple servers,

where each server holds a portion of the data. It enables


horizontal scaling and enhances performance.

● Range-based Sharding: Dividing data based on a range of

values.

● Hash-based Sharding: Using a hash function to determine

the shard for a data item.

Principles of Replication

Replication involves creating copies of a database on different

servers, enhancing data availability and redundancy.

● Master-Slave Replication: The master server is responsible

for writes, and changes are propagated to slave servers

which handle reads.

● Multi-Master Replication: Multiple servers can accept write

operations, synchronising data between them.

Sharding vs. Replication: When to Use Which

● Sharding: Preferred when there is a need for horizontal

scaling to handle large volumes of data and distribute the

load. Especially beneficial for write-heavy workloads.


● Replication: Best suited for improving data availability and

read-heavy workloads. It provides redundancy and can be

combined with sharding for robust scalability.

5.3 Security and Privacy in Database Systems

Securing NoSQL Databases

NoSQL databases have gained popularity due to their flexibility,

scalability, and performance benefits. Unlike traditional relational

databases, NoSQL databases do not use a fixed schema, which

allows for agile development and the ability to store a variety of

data structures. However, this flexibility can lead to unique

security challenges.

● Authentication and Authorization: Ensure robust user

authentication, preferably supporting multi-factor

authentication. Set fine-grained access controls, so users

only have access to the data and functions they need.

● Data at Rest Encryption: Encrypt the stored data to protect

it from theft or unauthorised access. Advanced encryption


standards, such as AES, should be used.

● Data in Transit Encryption: Use secure transmission

methods, like SSL/TLS, to protect data when it is being

transferred between the server and client or between

database nodes.

● Regular Backups: Maintain routine backup procedures and

test restoration processes to ensure data can be recovered

in the event of data loss or corruption.

Common Vulnerabilities in NoSQL

NoSQL databases have their own set of vulnerabilities distinct

from traditional SQL databases.

● Injection Attacks: Unlike SQL databases, NoSQL databases

are not vulnerable to SQL injection. However, they are prone

to other types of injection attacks, especially if user inputs

are not correctly sanitised.

● Insecure APIs: Some NoSQL databases expose insecure APIs

which can be exploited if not properly secured or patched.

● Exposure of Sensitive Data: Misconfigurations can lead to


unnecessary data exposure, making sensitive information

easily accessible.

Best Practices for NoSQL Database Security

● Limit Exposure: Ensure that the database is not directly

exposed to the internet. Using firewalls, VPNs, and VPCs can

help reduce the risk.

● Regular Patching: Always keep your NoSQL database up to

date with the latest patches and security updates.

● Monitoring and Alerts: Set up monitoring tools to detect

unusual activities and send alerts in real-time.

Regulatory Landscape in Data Management

With the rise in cyber threats, regulatory bodies around the world

have introduced regulations to protect user data.

Introduction to GDPR

The General Data Protection Regulation (GDPR) is a regulation

enacted by the European Union (EU) in 2018 that affects any

organisation that deals with the personal data of EU citizens.

● Rights of Individuals: GDPR emphasises the rights of


individuals, including the right to access, correct, and delete

personal data.

● Data Breach Notification: Organisations are required to

notify authorities and affected individuals within 72 hours of

discovering a data breach.

Implications of GDPR on Database Management

For database managers:

● Data Minimization: Only the essential data should be stored.

Data that is not necessary for the intended purpose should

be eliminated.

● Privacy by Design: Systems should be designed from the

outset to protect user data. This means considering privacy

at every stage of database design and implementation.

● Regular Audits: Regular audits and reviews should be

conducted to ensure compliance.

Other Global Data Protection Regulations

Apart from GDPR, there are other data protection regulations

worldwide:
● CCPA (California Consumer Privacy Act): Focuses on the

rights of California residents regarding their personal data.

● PDPA (Personal Data Protection Act): Regulations in

countries like Singapore that offer protections similar to

GDPR.

Ensuring Database Compliance

● Data Masking: For testing environments, mask data to

ensure sensitive information is not exposed.

● Access Controls: Ensure that only authorised individuals

have access to the data.

● Data Retention Policies: Clearly define and implement

policies regarding how long data is retained and when it is

deleted.

Auditing and Monitoring Database Access

● Logging: Keep logs of all database access, modifications, and

deletions.

● Regular Reviews: Periodically review the logs to detect any

unauthorised or suspicious activity.


● Alerts: Set up automated systems to alert administrators of

suspicious activities.

Compliance Tools and Strategies

● Automated Compliance Tools: Use tools that automatically

assess and ensure database compliance against various

regulations.

● Training: Ensure that all database administrators and related

personnel are trained about the latest regulations and best

practices.

● Third-Party Audits: Regularly hire external experts to audit

and ensure the system's compliance.

5.4 Emerging Trends in Database Systems

With the rapid growth of data and the increased need for efficient

storage and processing, there have been numerous advances in

database technologies. These advances aim to address the

challenges posed by modern applications, such as real-time

processing, IoT, AI, and big data analytics.


● Multi-Model Databases: Unlike traditional databases that

support a single data model, multi-model databases are

designed to support multiple data models, including

document, key-value, graph, and columnar. This enables

more flexibility in data storage and query optimization.

● Database as a Service (DBaaS): With the increasing adoption

of cloud computing, many organisations prefer database

services hosted in the cloud, reducing the overhead of

database management, setup, and maintenance.

Innovations in Storage and Retrieval

Storage and retrieval mechanisms have evolved to accommodate

the varying needs of modern applications and immense data loads.

● In-memory Databases: These databases store data in RAM

rather than on disk drives, allowing for faster data retrieval

and transaction processing.

● Columnar Storage: Instead of storing data row by row, data

is stored column by column, enabling faster query

performance and better compression.


● Automated Data Tiering: Databases automatically move

data between high-speed and low-speed storage media

based on the frequency of access.

New Database Architectures and Models

● Distributed Databases: Designed to run on multiple

machines, these databases ensure data availability and fault

tolerance.

● NoSQL Databases: Emphasising scalability and flexibility,

NoSQL databases support schema-less data structures.

● Graph Databases: These databases are specifically designed

for data sets that are best represented as graphs, facilitating

queries that map relationships.

Future of Databases in the AI Era AI is increasingly influencing the

way databases are designed, managed, and operated.

● Self-optimising Databases: Using AI algorithms, databases

can now tune themselves for optimal performance.

● Natural Language Processing (NLP): Databases can

understand and process user queries given in natural


language, enhancing user interaction.

Role of Databases in Machine Learning and AI

● Data Lakes: They store vast amounts of raw data in its native

format, aiding ML models by providing abundant training

data.

● Feature Stores: Centralised repositories for ML features,

aiding in faster model training and deployment.

Optimising Databases for AI Workloads

● GPU-accelerated Databases: With AI workloads being

computation-intensive, databases leveraging GPU offer

faster data processing.

● Data Lineage and Provenance: Ensuring data's origin and

transformations, which is crucial for AI model transparency

and reproducibility.

Big Data: Challenges and Opportunities

● Volume, Velocity, and Variety: Handling the 3Vs of big data

poses challenges in terms of storage, processing speed, and

data heterogeneity.
● Real-time Analytics: The need for instantaneous insights

requires databases to support real-time processing.

Evolving Nature of Big Data

● Edge Computing: Processing data closer to its source, often

in IoT devices, to reduce latency.

● Data Governance and Privacy: With increased scrutiny on

data handling, there's a strong emphasis on ensuring data

privacy and compliance.

Database Solutions for Handling Big Data

● Massively Parallel Processing (MPP) Databases: These

systems divide and conquer big data tasks by distributing

them across a cluster of nodes.

● Data Warehousing Solutions: Platforms like Snowflake and

Redshift are optimised for complex queries on large datasets.

5.5 Summary

❖ A strategy that leverages multiple database technologies,

ensuring that the data storage method aligns best with


individual data needs, emphasising the blend of SQL and

NoSQL databases.

❖ Essential processes to combine data from different sources

into a single, unified view. ETL (Extract, Transform, Load)

represents a standard method to fetch, refine, and store this

data.

❖ Techniques and strategies to improve database performance

and efficiency, focusing on tuning various databases like

MongoDB, HBase, Cassandra, and Neo4j, and understanding

scalability solutions such as sharding and replication.

❖ With the rise of NoSQL databases, unique vulnerabilities

emerge. This topic underscores the importance of

safeguarding these databases from potential threats and

security breaches.

❖ A deep dive into the data protection regulations, with an

emphasis on the General Data Protection Regulation (GDPR).

It highlights the necessity for databases to conform to global


and regional data protection mandates.

❖ Exploring the latest innovations in the database domain,

understanding how Big Data and Artificial Intelligence are

reshaping the future of database management systems.

5.6 Keywords

● Polyglot Persistence: This term refers to the strategy of

using multiple data storage technologies to handle different

data storage needs within a single application. Instead of

using a single database solution for all data-related

requirements, organisations utilise the best-suited database

for each specific task. For example, a relational database

might be used for transactional data while a NoSQL

database like MongoDB could be used for logging or caching.

● ETL Processes: ETL stands for Extract, Transform, and Load.

It's a process that involves extracting data from source

systems, transforming it into a format suitable for analysis,


and then loading it into a target data repository, typically a

data warehouse. This process is crucial for integrating data

from various sources, ensuring data quality, and preparing

data for analytics.

● Database Sharding: Sharding is a method used to distribute

data across multiple servers or databases. Each individual

database in such a configuration is referred to as a "shard".

Sharding is used to improve performance and ensure that

systems remain scalable. Each shard operates independently,

so operations on one shard don’t affect operations on

another.

● GDPR: The General Data Protection Regulation (GDPR) is a

regulation introduced by the European Union to protect the

personal data and privacy of its citizens. It has profound

implications for companies as they need to ensure that they

handle, process, and store personal data in compliance with

the regulation or face significant fines.


● Neo4j: Neo4j is a popular graph database management

system. Unlike relational databases, which store data in

tables, graph databases like Neo4j store data in nodes and

relationships, making them particularly well-suited for

handling interconnected data, like social networks or

recommendation systems.

● Big Data: Big Data refers to extremely large datasets that

can be analysed to reveal patterns, trends, and associations,

especially in relation to human behaviour and interactions.

The sheer volume, velocity, and variety of big data present

challenges in capturing, storing, analysing, and managing it.

Advanced database systems, including NoSQL databases,

distributed systems, and cloud platforms, have emerged to

address these challenges.

5.7 Self-Assessment Questions

1. How do SQL and NoSQL databases differ, and in which

scenarios might an integrated approach be beneficial?


2. What are the primary stages of the ETL process, and why are

they crucial for data integration in modern databases?

3. Which performance tuning techniques are best suited for

optimising read operations in HBase?

4. What are the main security vulnerabilities commonly

associated with NoSQL databases, and how can they be

mitigated?

5. How do recent advances in database technologies cater to

the needs of Big Data and AI applications?

5.8 Case Study

Title: Streamlining E-Commerce with Advanced Database

Management in India

Introduction:

India, with its rapid technological adoption, has witnessed a surge

in e-commerce platforms. Amid this surge was 'ShopBharat', an

emerging e-commerce platform aiming to highlight indigenous

products. With an inventory spanning millions of items,

ShopBharat faced challenges in database management, which


affected the platform's speed and user experience.

Background:

In the initial days, ShopBharat used a single relational database

system (RDBMS) to manage its burgeoning inventory, user details,

and transaction data. However, as the user base grew and the

product listings expanded, the platform began to experience

sluggish response times and occasional downtimes.

To tackle this, the data science team at ShopBharat considered a

polyglot persistence approach. They introduced a NoSQL database

for handling the catalogue and user activity data, while the

transactional data remained on the RDBMS. This meant that while

the structured order and billing data was managed with SQL, the

high-velocity and diverse product data was managed using NoSQL.

The results were dramatic. Not only did the site's performance

improve, but the new system also allowed for easier scalability.

ShopBharat could now add new products and features without

worrying about significant system overhauls. Moreover, the use of

NoSQL allowed them to introduce features like real-time product


recommendations and faster search functionalities.

Additionally, to ensure data security, especially with growing

concerns about data breaches globally, the team implemented

advanced encryption methods and ensured GDPR-like compliance,

even though India's data protection framework was still evolving.

This move not only salvaged ShopBharat's reputation but also

positioned it as a leader in technological innovation among e-

commerce platforms in India. Today, many Indian startups look at

ShopBharat's database management strategy as a gold standard

for handling vast amounts of data efficiently.

Questions:

1. How did ShopBharat address its challenges in database

management as the platform grew?

2. What benefits did the NoSQL database bring to ShopBharat's

user experience and functionality?

3. How did ShopBharat ensure data security and regulatory

compliance in its advanced database management system?

5.9 References
1. "Designing Data-Intensive Applications" by Martin

Kleppmann

2. "Database System Concepts" by Abraham Silberschatz,

Henry F. Korth, and S. Sudarshan

3. "NoSQL Distilled: A Brief Guide to the Emerging World of

Polyglot Persistence" by Pramod J. Sadalage and Martin

Fowler

4. "MongoDB: The Definitive Guide" by Kristina Chodorow

5. "Neo4j in Action" by Jonas Partner, Aleksa Vukotic, and Nicki

Watt.

You might also like