0% found this document useful (0 votes)
7 views

DBMS Unit1

Master of data science certifies course notes, Database Management System

Uploaded by

girab87633
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

DBMS Unit1

Master of data science certifies course notes, Database Management System

Uploaded by

girab87633
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Course: MSc DS

Advanced Database Management

Systems

Module: 1
Preface

In the age of information, the magnitude, speed, and diversity of

data generated have shifted the paradigms of traditional database

management. With the inexorable march of technology and our

growing reliance on digital platforms, the need to understand,

manage, and leverage this vast expanse of data has never been

more pressing. This book, tailored for the Master of Science in

Data Science course, seeks to serve as a comprehensive guide to

the universe of advanced database management systems.

Beginning with the foundational concepts of Big Data, we delve

into the world of NoSQL databases, elucidating the significant

departure they mark from their SQL counterparts. Through the

modules, we explore various NoSQL systems including MongoDB,

HBase, Cassandra, and Neo4j, each elucidated with their unique

characteristics, data models, and use cases.

But beyond just introducing these systems, this book aims to

foster a deep understanding, facilitating the practical application

of these databases in real-world scenarios. Our focus extends to


the challenges faced in today's digital landscape, from database

optimization to ensuring robust security measures.

As you embark on this journey, our hope is to not only equip you

with the technical knowledge but also to kindle a sense of

curiosity and appreciation for the vast possibilities that these

advanced database systems present in the realm of data science.

Learning Objectives:

1. Define and differentiate the five V's of Big Data.

2. Trace Big Data's evolutionary timeline and key milestones.

3. Understand Big Data's impact across diverse sectors.

4. Familiarise with core Big Data technologies and

infrastructure.

5. Grasp the importance of Big Data security and privacy.

6. Identify challenges and anticipate emerging Big Data trends.

Structure:

1.1 Understanding the Basics of Big Data


1.2 Evolution and Impact of Big Data

1.3 Big Data Technology and Infrastructure

1.4 Challenges and Future Prospects of Big Data

1.5 Summary

1.6 Keywords

1.7 Self-Assessment Questions

1.8 Case Study

1.9 Reference

1.1 Understanding the Basics of Big Data

Big Data, as the term suggests, involves managing and analysing

vast volumes of data. In Advanced Database Management

Systems, comprehending the scope and significance of Big Data is

fundamental. It's not just about the quantity of the data, but the

multifaceted nature of the data as well. By understanding its

different dimensions, one can harness its true potential and value.

Scope and Significance

In the modern digital age, data has become an invaluable asset for
industries, researchers, and governments. Its application extends

from personalised marketing to predicting pandemics, from

enhancing user experiences to streamlining supply chains. At the

heart of these applications is the management and interpretation

of vast amounts of data, which is where Big Data comes into play.

● Rapid Expansion: The sheer volume of data being generated

today is unprecedented. Devices, social media platforms,

sensors, and more, are producing zettabytes of data.

● Enhanced Decision-Making: Leveraging Big Data allows

organisations to make more informed decisions by analysing

patterns, trends, and relationships.

● Innovative Applications: Big Data has paved the way for

innovative applications in sectors like healthcare, finance,

and transportation, to name just a few.

The Multifaceted V's of Big Data

To effectively grasp the intricacies of Big Data, one must

familiarise themselves with the various 'V's associated with it:

● Volume: Grasping the Scale of Data- It pertains to the sheer


size of data generated. Think petabytes, exabytes, or even

zettabytes. Traditional database systems often fall short

when managing such large quantities of data. Hence,

specialised Big Data tools and techniques are crucial.

● Velocity: The Speed of Data Generation and Processing-

This reflects the pace at which new data is generated and

processed. For example, social media posts can get

generated at millions per minute, sensors in a

manufacturing plant may transmit data in real-time, etc.

Swift processing of this data can offer real-time insights and

responses.

● Variety: Handling Different Data Types- Unlike traditional

databases which mainly dealt with structured data, Big Data

encompasses structured, semi-structured, and unstructured

data. This can range from neatly formatted tables to erratic

tweets, from images to log files.

● Veracity: Ensuring Data Accuracy and Trustworthiness-

Given the diverse sources of Big Data, ensuring its accuracy


and reliability becomes paramount. Veracity deals with the

uncertainty of data, which could be due to inconsistency,

incompleteness, and available structure.

● Value: Extracting Insights from Raw Data- The ultimate aim

of Big Data is not just storage but extracting actionable

insights. This dimension focuses on turning raw data into

meaningful information through analytics, data mining, and

other such methods.

1.2 Evolution and Impact of Big Data

1. A Brief History of Big Data: From Archives to Real-time

Analytics

● Early Data Archives: Data storage can be traced back to

ancient civilizations that used clay tablets and parchments.

In modern history, we had the emergence of relational

databases in the 1970s that marked a significant step in

structured data storage.

● Late 20th Century: By the 1990s, the digital age was in full
swing. This led to an increase in the volume of data

generated, largely due to the birth of the internet and e-

commerce.

● 2000s Onwards: As the internet expanded, so did the

amount of user-generated content. The focus began to shift

from mere storage to processing and analysis. Concepts like

MapReduce and frameworks like Hadoop emerged to

process large datasets.

● Current Trends: Today, real-time analytics play a crucial role.

Tools such as Apache Kafka and Apache Spark allow

businesses to analyse data as it's generated, making

instantaneous decisions.

2. The Pervasive Role of Big Data Across Industries

Big Data has permeated almost every industry, leading to

enhanced decision-making, better customer experiences, and

innovative products and solutions. Its role has been

transformative, bringing about efficiencies that were once

considered unimaginable.
3. Healthcare: Personalized Treatment and Predictive Analytics

Personalised Treatment:

● DNA sequencing and genomics have ushered in an era

where treatments can be tailor-made for individuals.

● Data analytics can identify which treatment may work best

for a patient based on their genetic makeup and medical

history.

Predictive Analytics:

● Hospitals and clinics use data analytics to predict outbreaks

and disease spread.

● Data from wearables and IoT devices can provide insights

into patient health in real-time.

4. Finance: Algorithmic Trading and Risk Management

Algorithmic Trading:

● Traders employ complex algorithms that use historical and

real-time data to make trading decisions within milliseconds.

● These algorithms can adjust themselves based on market

conditions.
Risk Management:

● Financial institutions analyse vast amounts of data to assess

the creditworthiness of individuals or companies.

● Predictive models can identify potential loan defaults or

financial market crashes.

5. Retail: Customer Insights and Inventory Management

Customer Insights:

● Retailers use data analytics to understand customer

behaviour, preferences, and buying patterns.

● Predictive analytics can forecast product demands or the

success of marketing campaigns.

Inventory Management:

● Big Data tools can optimise stock levels, ensuring that goods

are available without excess inventory.

● Analytics can predict which products will be in demand

based on seasonality, market trends, and other variables.

6. Transportation: Traffic Predictions and Route Optimization

Traffic Predictions: Smart cities use data from various sensors


and user inputs to

predict traffic patterns and congestion.

Route Optimization:

● Navigation systems analyse real-time traffic data to suggest

the fastest route.

● Logistics companies use Big Data to optimise delivery routes,

reducing costs and improving efficiency.

7. Entertainment: Content Recommendations and Audience

Analysis

Content Recommendations: Streaming platforms, like

Netflix, use algorithms to recommend shows and movies

based on user preferences, viewing history, and global

trends.

Audience Analysis:

● Production houses and broadcasters analyse audience

data to gauge the success of shows or movies.

● Data-driven insights can influence content creation and

marketing strategies.
1.3 Big Data Technology and Infrastructure

The vast growth in data volume, velocity, and variety has

necessitated the evolution of data technologies and

infrastructures that can handle and analyse massive datasets

effectively.

Building the Backbone: Storage Solutions for Massive Data Sets

In today's digital age, effective storage solutions are imperative.

With an exponential increase in data generation, it's crucial to

have a robust storage backbone.

● Traditional Storage Systems: These are your conventional

hard drives and server storage. While they work for smaller

datasets, they often prove inadequate for big data due to

speed and capacity limitations.

● Distributed Storage: Solutions like Hadoop’s Distributed File

System (HDFS) allow for the distribution of data across

multiple machines, enhancing data reliability and access


speed.

Traditional Databases vs. NoSQL: When to Use Which

Understanding the distinction between traditional and NoSQL

databases is pivotal.

Traditional Databases (RDBMS):

● Structured around relational models.

● Often use SQL for querying.

● Best for structured data with a consistent schema.

NoSQL Databases:

● Allow for more flexible data models.

● Types include document-based, columnar, graph, and

key-value stores.

● Suited for varied data structures and rapid scalability.

Hadoop Ecosystem: Foundations of Distributed Storage

Hadoop is a cornerstone in big data handling, with its ecosystem

offering a suite of tools for diverse tasks.

● HDFS: A distributed file system designed for high-throughput

access.
● YARN: A resource manager for Hadoop, ensuring optimal

resource usage.

● HBase: A distributed, scalable, big data store.

● Pig and Hive: Tools that allow for data transformation and

querying using a higher-level language than Java.

Cloud Solutions: Scalable and Flexible Data Warehousing

With the rise of cloud technologies, data warehousing has evolved.

● Benefits: On-demand scalability, pay-as-you-go pricing, and

global availability.

● Providers: AWS Redshift, Google BigQuery, and Azure

Synapse Analytics are leading solutions, offering robustness

and flexibility.

Processing Power: Modern Tools and Techniques

Modern big data analysis requires immense computational power.

● In-memory Computing: Tools like Apache Spark allow for

faster data processing by storing data in RAM rather than on

disk.

● Graph Processing: Solutions like Neo4j and Titan enable


processing of graph-based data efficiently.

MapReduce: Parallel Processing for Large Datasets

Originated by Google, MapReduce allows for the parallel

processing of vast datasets.

● Map Phase: Data is broken down into key-value pairs.

● Reduce Phase: The pairs are consolidated to produce a

dataset of reduced size.

Stream Processing: Analysing Data on the Fly

As data gets generated in real-time, immediate analysis becomes

paramount.

● Apache Kafka: A platform for building real-time data

pipelines.

● Apache Flink and Storm: For stream processing and real-

time analytics.

Unearthing Insights: Advanced Analytics Technologies

Advanced analytics technologies delve deep into data, extracting

meaningful insights.

● Data Mining: Techniques to identify patterns in large


datasets.

● Statistical Analysis: Approaches like regression analysis to

predict and model data behaviours.

Machine Learning in Big Data: Predictive Modelling at Scale

Leveraging big data with machine learning has transformative

potential.

● Deep Learning: Neural networks designed to process vast

amounts of data, useful in fields like image and speech

recognition.

● ML Libraries: Tools like TensorFlow and PyTorch offer

scalable machine learning solutions.

Complex Event Processing: Real-time Analytics and Decision-

making

This involves analysing multiple streams of information from

various sources to detect critical situations.

● CEP Engines: Tools like Esper and SAP Event Stream

Processor that help in detecting patterns in the real-time

flow of events.
Navigating the Murky Waters: Security and Privacy in the Big

Data Era

The surge in data volume brings about intensified security and

privacy challenges.

● Access Controls: Techniques to restrict unauthorised data

access.

● Intrusion Detection Systems (IDS): Monitors networks for

malicious activities or policy violations.

Data Encryption Techniques: Safeguarding Sensitive Information

Encryption plays a pivotal role in ensuring data security.

● Symmetric Encryption: The same key is used for encryption

and decryption.

● Asymmetric Encryption: Uses a pair of public and private

keys.

Privacy Concerns: Balancing Utility with Confidentiality

Harnessing big data's potential while ensuring user privacy is a

balancing act.
● Data Masking: Presenting data in a sanitised form, ensuring

data usability without revealing sensitive information.

● Differential Privacy: Adding noise to the output of queries to

maintain user privacy.

Regulatory Landscape: GDPR, CCPA, and Other Data Protection

Laws

Adherence to data protection laws is non-negotiable.

● GDPR: A European regulation dictating data protection and

privacy.

● CCPA: California's data protection law, emphasising user

rights over personal data.

1.4 Challenges and Future Prospects of Big Data

Challenges in Big Data Management

Data Quality and Cleanliness: Ensuring Reliable Analysis

● Ensuring the reliability and accuracy of data is

paramount in Big Data management. Dirty or unclean

data can lead to misleading insights, affecting decision-


making processes.

● Datasets might have inconsistencies, missing values, or

erroneous entries which require data preprocessing and

cleaning techniques. The magnitude of data in Big Data

scenarios makes manual cleaning implausible,

necessitating automated or semi-automated techniques.

Scalability Issues: Expanding Infrastructure with Growing

Data

● As the volume of data increases exponentially, the

infrastructure to store, process, and analyse it must scale

accordingly.

● Traditional RDBMS (Relational Database Management

Systems) often struggle with such scale, necessitating

distributed systems like Hadoop or NoSQL databases.

Talent Shortage: The Need for Skilled Big Data

Professionals

● The rapid growth in the field of Big Data has outpaced

the availability of skilled professionals adept at handling


its nuances.

● There's a rising demand for data engineers, data

scientists, and other related roles. Institutions and

businesses are striving to address this gap, but it remains

a significant challenge.

Horizon Scanning: Upcoming Trends and Innovations in Big Data

Quantum Computing: The Next Frontier for Data

Processing?

● Quantum computing promises exponential speed-ups for

certain types of computations. Its application in Big Data

could revolutionise data processing, analysis, and storage.

● While it's still in nascent stages, the eventual maturation

of quantum computing could lead to breakthroughs in

data analytics, AI modelling, and more.

Edge Computing: Processing Data Closer to the Source

● Rather than sending all data to centralised data centres,

edge computing processes data closer to its source (like

IoT devices).
● This reduces the latency and bandwidth needs, offering

more real-time insights. Especially in industries like

healthcare or manufacturing, immediate data processing

can be crucial.

Ethical Considerations: AI and Big Data's Moral Implications

As we harness more data and build advanced AI

models, ethical concerns come to the forefront.

● Privacy Concerns: Gathering massive amounts of

data can infringe upon individual privacy.

Regulations like GDPR have been implemented to

address this, but the debate continues.

● Bias and Fairness: AI models trained on biassed

data can perpetuate or exacerbate existing

societal biases. It's essential to ensure fairness in

model training and predictions.

● Transparency and Accountability: With complex

models and algorithms, ensuring transparency in

decision-making processes becomes challenging.


Organisations need to address this to maintain

public trust.

1.5 Summary

❖ Big Data refers to the massive volume of structured and

unstructured data that's too large to process using

traditional database systems.

❖ The V's of Big Data:

● Volume: Large amounts of data.

● Velocity: Speed at which new data is generated and

collected.

● Variety: Different types of data (e.g., text, images,

sound).

● Veracity: The accuracy and trustworthiness of data.

● Value: The importance of turning data into actionable

insights.

❖ Big Data has evolved from simple archives to sophisticated

real-time analytics, driven by advancements in technology


and a surge in data generation.

❖ Big Data plays a crucial role across industries, driving

decisions in healthcare, finance, retail, and more by

providing deeper insights into patterns and trends.

❖ The technology behind Big Data includes advanced storage

solutions (like Hadoop and cloud platforms), processing

tools (like MapReduce and Stream Processing), and analytics

technologies for deriving insights.

❖ While Big Data offers immense possibilities, it comes with

challenges like data security, scalability, and quality concerns.

However, emerging trends like quantum computing and

edge processing promise to redefine its future landscape.

1.6 Keywords

● NoSQL: "NoSQL" stands for "not only SQL" and refers to

non-relational database systems designed to handle vast

volumes of structured and unstructured data. They are more

scalable and flexible than traditional relational databases,


making them well-suited for big data and real-time

applications. Examples include MongoDB, Cassandra, and

Couchbase.

● Hadoop Ecosystem: The Hadoop Ecosystem is a framework

and a collection of tools and technologies that use Hadoop

as a backbone to handle big data. Central to the ecosystem

is the Hadoop Distributed File System (HDFS), which stores

vast amounts of data across multiple machines. Tools like

MapReduce enable distributed data processing, while others

like Hive and Pig offer querying and analytics capabilities.

● MapReduce: MapReduce is a programming model and data

processing technique that allows for the parallel processing

of large datasets in a distributed environment, especially

within the Hadoop framework. It involves two primary tasks:

the "Map" task, where input data is divided into chunks and

processed independently, and the "Reduce" task, where the

processed data is aggregated into a smaller set of values.


● Stream Processing: Stream processing is a computational

method designed to analyse and act on real-time data

streams rather than waiting to collect large batches of data.

It's crucial for applications that require real-time analytics

and decision-making, such as financial trading platforms or

social media analytics.

● GDPR: The General Data Protection Regulation (GDPR) is a

regulation introduced by the European Union (EU) to protect

the personal data and privacy of its citizens. It provides

individuals with greater control over their personal data,

including the right to access, correct, or delete their

information. Organisations that handle the data of EU

citizens, irrespective of their location, must comply with

GDPR, which has significant implications for big data

processing and storage.

● Edge Computing: Edge computing refers to the practice of

processing data closer to the source where it is generated,


such as IoT devices or local data centres, rather than sending

it to centralised cloud-based systems. This approach reduces

latency, conserves bandwidth, and allows for faster data

analysis and decision-making, especially important in real-

time applications.

1.7 Self-Assessment Questions

1. How would you differentiate between traditional databases

and NoSQL databases in the context of big data storage

solutions?

2. What are the five V's of Big Data, and why are they crucial

for understanding the complexity of modern data

management?

3. Which industries have been most transformed by the

application of Big Data analytics, and can you provide

specific examples of its impact?

4. What security and privacy considerations become

paramount when dealing with Big Data, especially in a global

context with various regulatory environments?


5. Which emerging trends, such as Quantum Computing or

Edge Computing, do you foresee having the most significant

impact on the future of Big Data management and analytics?

Why?

1.8 Case Study

Title: Implementation of Advanced Database Management

System at Tokyo Metro Corporation, Japan

Introduction:

Tokyo Metro Corporation, responsible for managing one of the

world's busiest metropolitan transit systems, sought to enhance

its real-time operational efficiency, safety protocols, and

customer experience. With the daily ridership surpassing 9 million,

the existing database system was becoming overwhelmed,

causing lags and inconsistencies in data retrieval and processing.

Challenge:

The conventional relational database structure was struggling

with the massive velocity and volume of data. This led to delays in

real-time tracking, ticketing, and decision-making processes. It


was also hindering the integration of diverse data sources,

including traffic patterns, weather data, customer feedback, and

safety metrics.

Solution:

The company collaborated with a leading technology firm to

design and implement a NoSQL database system tailored to its

unique needs. This system utilised technologies such as Apache

Hadoop and Apache Cassandra for distributed storage, coupled

with real-time analytics tools like Apache Kafka for streaming data.

1. Scalability and Performance: The NoSQL architecture

provided horizontal scalability, allowing the system to grow

and shrink with fluctuating data demands.

2. Integration of Variety: The new system could integrate

various data sources, providing a unified view of operations.

3. Real-Time Analytics: Utilising stream processing enabled

Tokyo Metro to analyse data on the fly, providing actionable

insights into various operational aspects, such as predicting

train delays or optimising scheduling.


Outcome:

The implementation led to a 30% increase in system efficiency, a

reduction in decision-making time, and a marked improvement in

customer satisfaction. By embracing modern database

technologies, Tokyo Metro managed to stay ahead of the growing

demands of urban transportation, offering a robust, resilient, and

responsive service.

Questions:

1. Evaluate the Challenges: What were the primary challenges

faced by Tokyo Metro Corporation with the existing

database system, and how did the newly implemented

NoSQL solution address these challenges?

2. Technological Considerations: Discuss the specific

technologies (such as Hadoop, Cassandra, and Kafka) used in

this case. How did they contribute to the efficiency and

scalability of the system?

3. Ethical and Security Implications: Considering the sensitive

nature of the data involved, what might be the ethical and


security considerations that Tokyo Metro Corporation

should keep in mind while handling this large-scale data?

How can these be effectively addressed?

1.9 References

1. "Big Data: Principles and Best Practices of Scalable Real-time

Data Systems" by Nathan Marz and James Warren.

2. "Data Science for Business: What You Need to Know about

Data Mining and Data-Analytic Thinking" by Foster Provost

and Tom Fawcett.

3. "Hadoop: The Definitive Guide" by Tom White.

4. "Database System Concepts" by Abraham Silberschatz,

Henry F. Korth, and S. Sudarshan.

5. "NoSQL Distilled: A Brief Guide to the Emerging World of

Polyglot Persistence" by Pramod J. Sadalage and Martin

Fowler.

You might also like