0% found this document useful (0 votes)

4 views153 pages

BDT Unit 02 - Part1

The CET4001B Big Data Technologies course aims to provide students with an understanding of Big Data concepts, NoSQL databases, and application design for distributed systems. It covers various database types, including centralized, distributed, relational, and NoSQL databases, along with their advantages and limitations. The course also emphasizes the use of Big Data visualization tools and the application of these technologies in real-world scenarios.

Uploaded by

chaitealattenaanbread

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views153 pages

BDT Unit 02 - Part1

Uploaded by

chaitealattenaanbread

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 153

CET4001B Big Data

Technologies
Department of Computer Engineering and
Technology

1/16/2024 Big Data Analytics Lab 1

CET4001B Big Data Technologies

Teaching Scheme Credits: 02 + 01

Theory: 3 Hrs / Week Practical: 2Hrs/Week

Course Objectives:
•• Understand the various aspects of Big Data.
•• Learn the concepts of NoSQL for Big Data.
•• Design an application for distributed systems on Big Data.
•• Explore the various Big Data visualization tools.

Course Outcomes:
•• Apply the insights of Big Data in business applications.
•• Illustrate the application of MongoDB in real world applications.
•• Build hadoop based distributed systems for real world problem.
•• Apply and utilize big data visualization tools for real world applications.

1/16/2024 Big Data Analytics Lab 2

• Types of databases
• structured versus unstructured data
• NoSQL movement and concept of NoSQL
database
• comparative study of SQL and NoSQL
• Types and examples of NoSQL database

Unit-II: NoSQL • key value store, document store,

columnar databases, graph databases.

databases for • Characteristics of NoSQL

• NoSQL data modelling

Big Data • Advantages of NoSQL

• CAP theorem
• BASE properties
• Sharding
• characteristics, advantages, types.
• NoSQL using MongoDB
• - MongoDB shell, data types, CRUD
operations, querying, aggregation
framework operators, indexing.

1/16/2024 UNIT II- 3

Types of Databases

UNIT II- 4
4
Centralized Database

A type of database that stores data at a centralized

database system.

Comforts the users to access the stored data from

different locations through several applications that
contain the authentication process to let users access
data securely.
Example of a Centralized database : a Central
Library that carries a central database of each library in
a college/university.

UNIT II- 5
5
Centralized
Database

UNIT II- 6
6
It has decreased
It is less costly the risk of data
because fewer management, i.e.,
vendors are manipulation of
required to handle data will not affect
the data sets. the core data.

Advantages It provides better

data quality, which Data consistency is

of enables
organizations to
establish data
maintained as it
manages data in a
central repository.

Centralized standards.

Database

UNIT II- 7
The size of the centralized
database is large, which
increases the response time
for fetching the data.

Limitations
of not easy to update such an
Centralized extensive database system.
Database
If any server failure occurs,
entire data will be lost, which
could be a huge loss.

UNIT II- 8
Distributed Database System
• Data stored at a number of sites
each site logically consists of a
single processor.
• Processors at different sites are
interconnected by a computer
network no multiprocessors
– parallel database systems
• Distributed database is a
database, not a collection of files
data logically related as exhibited
in the users’ access patterns
– relational data model
Distributed Database System • D-DBMS is a full-fledged DBMS
– not remote file system, not a TP
system

Database Management System 9

Distributed Database
Distributed Database

– Unlike a centralized
database system,
data is distributed
among different
database systems
of an organization.
– are connected via
communication
links. which help
the end-users to
access the data
easily.
– Examples : Apache
Cassandra, HBase,
Ignite, etc.

UNIT II- 11
11
Distributed Database Types

Homogeneous DDB:
•• Execute on the same operating
system
•• use the same application
process
•• carry the same hardware
Heterogeneous
devices. DDB:
•• execute on different operating
systems
•• under different application
procedures,
•• carries different hardware
devices. UNIT II- 12
12
Advantages

• Increased Reliability and availability

• Reliability is basically defined as the probability that a system is
running at a certain time whereas Availability is defined as the
probability that the system is continuously available during a time
interval.
• Easier Expansion
• In a distributed environment expansion of the system in terms of
adding more data, increasing database sizes, or adding more data,
increasing database sizes or adding more processor is much easier.
• Improved Performance
• We can achieve inter-query and intra-query parallelism by executing
multiple queries at different sites by breaking up a query into a
number of sub-queries that basically executes in parallel which
basically leads to improvement in performance

Database Management System 13

Advantages of Distributed
Database

Modular development is possible i.e., the system can be expanded by including

new computers and connecting them to the distributed system.

One server failure will not affect the entire data set.

A user doesn’t know where the data is located physically.

the data presented to the user as if it were located locally.

Data can be joined and updated from different tables which are located on
different machines.

It is secure.

UNIT II- 14
14
Disadvantages
• The distributed database is quite complex
• This database is more expensive as it is complex and hence,
difficult to maintain.
• As it is distributed system it requires database to be more
secure and each and ever node should be secure as well.
• Data integrity and Data Redundancy will not be maintained.

Database Management System 15

Limitations of Distributed
Databases

Network traffic is increased in Different data formats are used

a distributed database. in different systems.

While recovering a failed

system, the DBMS has to
Managing distributed deadlock
make sure that the recovered
is a difficult task.
system is consistent with other
systems.

UNIT II- 16
Relational Database

which stores data in

the form of rows uses SQL for storing,
based on the relational (tuple) and manipulating, as well
data model, columns(attributes), as maintaining the
and together forms a data.
table(relation).

Each table carries a

Examples : MySQL,
E.F. Codd invented the key that makes the
Microsoft SQL Server,
database in 1970. data unique from
Oracle, etc.
others.

UNIT II- 17
17
Relational
Database
Representation

UNIT II- 18
18
4. Cloud Database

provides users with

[[Software as a Service
data is stored in a various cloud
(SaaS), Platform as a
virtual environment and computing services
Service (PaaS), and
executes over the cloud (SaaS, PaaS, IaaS,
Infrastructure as a
computing platform. etc.) for accessing the
Service (IaaS) ]]
database.

UNIT II- 19
Cloud
Database

• There are numerous

cloud platforms, but
the best options are:

• Amazon Web Services

• (AWS)
• Microsoft Azure
• Kamatera
• PhonixNAP
• ScienceSoft
• Google Cloud SQL, etc.

UNIT II- 20
Object-oriented Database

Uses the object-based data model approach for storing

data in the database system.

Data is represented and stored as objects which are similar

to the objects used in the object-oriented programming
language.

Examples : Smalltalk in Gemstone,LISP in Gbase.

UNIT II- 21
Object-oriented Database
Representation

UNIT II- 22
Hierarchical Database :

Stores data in the

form of
parent-children
relationship nodes.

It organizes data in a
tree-like structure.

UNIT II- 23
Hierarchical Databases :

Mandates that each child record has only one parent,

whereas each parent record can have one or more
child records.

In order to retrieve data from a hierarchical database

the whole tree needs to be traversed starting from the
root node.

Examples : IBM Information Management

System(IMS),RDM Mobile.

UNIT II- 24
Network Database

– typically follows the network data

model.

– the representation of data is in the

form of nodes connected via links
between them.

– Unlike the hierarchical database, it

allows each record to have multiple
children and parent nodes to form a
generalized graph structure.

UNIT II- 25
Network
Database
Continued

• Some well-known database systems that use the network model

include:
• Raima Database Manager
• Integrated DMS (IDMS)
• TurboIMAGE

UNIT II- 26
NoSQL Database

Not Only SQL :

a type of database that is used for storing a wide range of data sets.

non-relational
presents a wide variety of database technologies in response to the
demands.
Schema-less
relaxes one or more of the ACID properties(Will be explained in CAP
theorem)

UNIT II- 27
27
Motivation for NoSQL
Database

Limitations of relational databases

Not suitable for distributed applications

because:
•• Joins are expensive
•• Hard to scale horizontally
•• Can’t handle unstructured data
•• Expensive : Product Cost ,Hardware maintenance.
•• High availability not possible
•• no partition tolerance
•• speed is less

UNIT II- 28
28
Structured vs Unstructured

• Structured systems are those where the activity of processing

and output is predetermined and highly organized.
• ATM transactions, airline reservations, manufacturing inventory control
systems, point of sale systems are all forms of structured systems
• unstructured systems are those that have little or no
predetermined form or structure
• Unstructured systems include email, reports, contracts, and other
communications.
• The rules of unstructured system are fewer and less complex

Analyzing unstructured data requires processing of huge amounts of data. Which

SQL databases are no good for, and were never designed for.

NoSQL databases have evolved to handle this huge data properly.

Data
Structures

•Data comes in multiple

forms - including
structured, unstructured
form
•80–90% of future data
growth coming from
nonstructured data types
Data
structures
(contd..)

•Structured Data : Data

containing a defined
data type, format, and
structure (that is,
transaction data, online
analytical processing
[OLAP] data cubes,
traditional RDBMS, CSV
files, and even simple
spreadsheets)
Data
Structures
(contd..)

•Semi-structured data:
Textual data files with a
discernible pattern that
enables parsing (such as
Extensible Markup
Language [XML] data
files that are
self-describing and
defined by an XML
schema)
Data Structures (contd..)

◼Unstructured data: Data that has no inherent

structure, which may include text documents,
PDFs,images, and video.
Types of data in Big Data Scenario
• Structured data :
■ Is highly-organized and formatted in a way so it's easily searchable
in relational databases.
■ Common relational database applications with structured data :
airline reservation systems, inventory control, sales transactions,
and ATM activity.

•Unstructured data :
■ Unstructured data has no pre-defined format or organization,
making it much more difficult to collect, process, and analyze.
■ Unstructured data has internal structure but is not structured via
pre-defined data models or schema.
■ It may be textual or non-textual, and human- or
machine-generated.
■ It may also be stored withinUNIT
a non-relational
II- database like NoSQL.
34
34
Types of data in Big Data
Scenario

• Semi-Structured data :
■ maintains internal tags and markings that identify
separate data elements, which enables information
grouping and hierarchies.
■ Both documents and databases can be
semi-structured.
■ This type of data only represents about 5-10% of the
structured/semi-structured/unstructured data pie.
■ Typical use is in OO models
■ Examples : CSV,XML,JSON

UNIT II- 35
35
Difference between Structured and Unstructured Data

UNIT II- 36 36
Motivation
for NoSQL
Database

•Relational Databases
are not suitable for
distributed computing

UNIT II- 37
37
Motivation
for NoSQL
Database

UNIT II- 38
38
Motivation
for NoSQL
Database
Performance of RDBMS for
various applications

UNIT II- 39
39
Addressing system growth

Every database has This makes data When the memory Scalability :
to be scaled to available at all times of the database is capability of a
address the huge for users. drained, or when it system, network, or
amount of data cannot handle process to handle a
being generated multiple requests, it growing amount of
each day. is not scalable. work, or its potential
to be enlarged to
accommodate that
growth.
Vertical Scaling and
Horizontal Scaling

UNIT II- 40
Scaling datbases

• Elasticity : degree to which

a system can adapt to
workload changes by
provisioning and
de-provisioning resources in
an on-demand manner,
such that,
• At each point in time the
available resources match
the current demand as
closely as possible

• Types of Scaling :

UNIT II- 41
Vertical Scaling

Vertical Scaling :adopted when the database

couldn’t handle the large amount of data.
•• Example :Suppose you have a database server with
10GB memory and it has exhausted. Now, to handle
more data, you buy an expensive server with memory
of 2TB.
•• it involves adding more power such as CPU and disk
power to enhance your storage process.
•• applicable to applications involving a limited range of
users and minimal querying
•• Application : Relational databases mostly use
vertical scaling.
UNIT II- 42
Vertical Scaling
Limitations :
Advantages : ● Difficult to perform multiple
queries simultaneously.
•• Simple, since everything exists
in a single server. No need to ● Chances of downtime are high,
manage multiple instances. when the server exceeds
•• Performance Gain, because maximum load.
you have faster RAM and ● Expensive. Hardware
memory power on each update. resources are costly, after all.
•• Same Code. No change — You
need not change your
implementation or your code at
all.

UNIT II- 43
Horizontal Scaling

Horizontal Scaling :scaling of the server horizontally

by adding more machines.
•• It divides the data set and distributes the data over multiple servers,
or shards.
•• Each shard is an independent database.
•• Instead of buying a single 2 TB server, we are buying two hundred
10 GB servers.
•• If your application can allow redundancy and involves less joins,
then you can use horizontal scaling.
•• Applications : NoSQL databases mostly use horizontal scaling. It is
less suitable for RDBMS as it relies on strict Consistency and
Atomicity rules.

UNIT II- 44
Horizontal Scaling
Limitations :
Advantages : ● Making joins is difficult,
cheap compared to vertical scaling. due to cross-server
● Lesser Load, Better performance. communication.
● Chances of downtime are less. ● Eventual consistency is
● Resilience and Fault Tolerance. only possible.
● Suitable for Distributed Databases ● It may not be best suited
for bank transactions,

UNIT II- 45
Horizontal and Vertical scaling
What is NoSQL?

NoSQL is a non-relational database management systems

Designed for distributed data stores where very large scale of

data storing needs to be available
•• e.g. Google or Facebook which collects terabits of data every day for their
users
These data stores may not require fixed-table schemas, and
usually avoid join operations and typically scale horizontally

“Non-relational” may be more accurate term than “NoSQL”,

as some NoSQL DBs do support a subset of SQL.
Characteristics of •NoSQL avoids :
• Overhead of ACID
NoSQL Databases transactions
• Complexity of SQL
queries
• Burden of
schema-design
• DBA presence
•Provides :
• Easy and frequent
changes to DB
• Fast Development
• Can handle large data
volume ( Big Data
Applications)
• Schema-less

UNIT II- 48
48
When and when not to use NoSQL

UNIT II- 49
49
NoSQL : Applications and
Popularity

UNIT II- 50
50
Schema-less Data
Model

• No fixed schema to
consider
• No implicit datatypes
• Most considerations
done at application
layer including
transactions
• All aggregate data is
gathered in
documents.

UNIT II- 51
51
CAP – NoSQL Data models

Three basic requirements which exist in a special relation when designing

applications for a distributed architecture.

•• Consistency : the data in the database remains consistent after the execution of an operation.
•• Availability : the system is always on (Service guarantee availability), no downtime
•• Partition Tolerance : the system continues to function even if the communication among the
servers is unreliable, i.e. the servers may be partitioned into multiple groups that cannot
communicate with one another.

Generally, it is not be possible to fulfil all three requirements in a distributed

system. CAP provides the basic requirements for a distributed system to
follow two of the three requirements.

Distributed systems must be partition tolerant (P), so Current NoSQL

databases follow the different combinations of C and A from the CAP
theorem.
CAP Theorem
CAP Theorem

•Acronym for Consistency,

Availability and Partition Tolerance
• Consistency : once data is
written ,all future read requests
will contain that data
• Availability : database is always
available and responsive.
• Partition Tolerance : if part of
database is unavailable ,other
parts are not affected.

UNIT II- 54
CAP Theorem
CA : Consistent and Available
Examples : standalone Mysql server/node which has no replication.
It provides consistency and availability till it goes down.
Applications : Bank Account Balance,Text messages which require higher
consistency.RDBMS are CA systems.
AP : Available and Partition Tolerant
Example : Distributed NoSQL database where replication to nodes happens
asynchronously.
system will always respond, but not all the nodes will have the latest version of the
data when queried
Applications :E Commerce Sites which focus on high availability in case of partitions
in distributed environment by trading off consistency..
CP : Consistent and Partition Tolerant
Similar to CA systems but difference is its applicability to distributed environment.
In Mongodb the primary node is replicated into secondary nodes.If the primary node
fails then system switches to secondary node.During this switch data is not made
available to user.

UNIT II- 55
CAP Theorem

UNIT II- 56
CAP –
NoSQL
Datamodels
• CA - Single site cluster,
therefore all nodes are
always in contact. When
a partition occurs, the
system blocks.
• CP - Some data may not
be accessible, but the
rest is still
consistent/accurate.
• AP - System is still
available under
partitioning, but some of
the data returned may
be inaccurate.
CAP Theorem

No distributed system is safe from network failures, thus network

partitioning generally has to be tolerated.

In the presence of a partition, one is then left with two options:

consistency or availability.

When choosing consistency over availability: the system will return an

error or a time out if particular information cannot be guaranteed to be up
to date due to network partitioning.
When choosing availability over consistency: the system will always
process the query and try to return the most recent available version of
the information, even if it cannot guarantee it is up to date due to network
partitioning.

UNIT II- 58
In the absence of network failure – that is, when
the distributed system is running normally – both
availability and consistency can be satisfied.
CAP
Theorem
CAP is frequently misunderstood as if one has to
choose to abandon one of the three guarantees
at all times. In fact, the choice is really between
consistency and availability only when a
network partition or failure happens; at all
other times, no trade-off has to be made.

UNIT II- 59
A network partition refers to network
decomposition into relatively independent
subnets for their separate optimization as well
as network split due to the failure of network
devices.In both cases the partition-tolerant
behavior of subnets is expected.

CAP
Partition Tolerance or robustness means
Theorem that a given system continues to operate
even with data loss or node failure.

A single node failure should not cause system

failure.

UNIT II- 60
CAP Theorem
• Database systems designed wit traditional ACID guarantees
in mind such as RDBMS choose consistency over
availability,

Systems designed around the BASE philosophy, common in

the NoSQL movement for example, choose availability over
consistency.

• [[ Network partition refers to network decomposition into

relatively independent subnets for their separate
optimization as well as network split due to failure of
network devices ]]
UNIT II- 61
Limitations of CAP
theorem
▪ When companies such as Google and Amazon were designing
large-scale databases, 24/7 Availability was a key
▪ A few minutes of downtime means lost revenue

▪ When horizontally scaling databases to 1000s of machines, the

likelihood of a node or a network failure increases tremendously

▪ Therefore, in order to have strong guarantees on Availability and

Partition Tolerance, they had to sacrifice “strict” Consistency
(implied by the CAP theorem)

▪ The CAP theorem proves that it is impossible to guarantee strict

Consistency and Availability while being able to tolerate network
partitions

▪ This resulted in databases with relaxed ACID guarantees

UNIT II- 62
Trading-Off Consistency

▪ Maintaining consistency should balance between the strictness

of consistency versus availability/scalability
▪ Good-enough consistency depends on your application

Loose Consistency Strict Consistency

Easier to implement, and is Generally hard to implement, and is

efficient inefficient

UNIT II- 63
The BASE Properties

▪ To overcome the limitations of CAP theorem the BASE

properties were introduced and implemented by such
databases as follows :
▪ Basically Available: the system guarantees Availability
▪ Soft-State: the state of the system may change over time
▪ Eventual Consistency: the system will eventually
become consistent

UNIT II- 64
The BASE Properties

▪ Horizontally scalable Databases use BASE properties

Basically Available: the system guarantees Availability
▪ BASE databases spread data across many storage systems with a high
degree of replication which guarantees
▪ In the unlikely event that a failure disrupts access to a segment of data, this
does not necessarily result in a complete database outage.
Soft-State: the state of the system may change over time
▪ Values stored in the system may change because of the eventual
consistency model
Eventual Consistency: the system will eventually become consistent
As data is added to the system, the system’s state is gradually replicated
across all nodes, during the short period of time before all updated blocks
are replicated, the state of the ﬁle system isn’t consistent

By sacriﬁcing Permanent Consistency in favor of Eventual Consistency

developers enable Horizontal Scalability.
UNIT II- 65
Acronym contrived to be
the opposite of ACID
•• Basically Available,
•• Soft state,
BASE •• Eventually Consistent
Transactions

Characteristics

•• Weak consistency – stale data

OK
•• Availability first
•• Best effort
•• Simpler and faster
UNIT II- 66
ACID - BASE

• Basically
• Atomicity Available (CP)
• Consistency • Soft-state
• Isolation • Eventually
• Durability consistent (AP)

67
Data Models and
Types of NoSQL DBs
NoSQL Data
Model
Distributed Key-Value Systems -
Lookup a single value for a key
•• Amazon’s Dynamo
Document-based Systems - Access
data by key or by search of
“document” data.
Types of •• CouchDB
•• MongoDB
NoSQL
Column-based Systems
databases
•• Google’s BigTable
•• Facebook’s Cassandra

Graph-based Systems - Use a graph

structure
•• Google’s Pregel
•• Neo4j
NoSQL
Database
Types
•71

UNIT II- 71
Key-Value Pair (KVP) Stores
Access data (values) by strings called keys.

Data has no required format – data may have any format

Extremely simple interface

•• Data model: (key, value) pairs
•• Basic Operations: Insert(key,value), Fetch(key),Update(key), Delete(key)

Implementation: efficiency, scalability, fault-tolerance

•• Records distributed to nodes based on key
•• Replication
•• Single-record transactions, “eventual consistency”

Example systems
•• Amazon Dynamo
“Value” is stored as a “blob”
- Without caring or knowing what is inside

- Application is responsible for understanding the data

In simple terms, a NoSQL Key-Value store is a single table with

two columns: one being the (Primary) Key, and the other being
the Value.

Each record may have a different

simplest type of database storage

stores single item as a key (or attribute name)

holding its value, together.
NoSQL
Like a hash
Database
Types: Does not require a specific data format.It can be
any.

Key value Data Model : (Key + Value ) pairs.

stores Basic Operations :
insert(key,value),Fetch(key),Update(key),Delete(k
ey)

Examples : Amazon DynamoDB,redis,riak.

74 UNIT II-
Key-Value Store

UNIT II- 75
Column-ba
sed Data
Model
• Column Family :
• Column is the smallest
instance of data.
• It is a row containing
name,value and
timestamp.
• Examples : Apache
Cassandra used by
Facebook

UNIT II- 76
Column-based Data Model
This type of data store is good for

(1) Distributed data storage, especially versioned data

because of the time-stamps.

(2) Large-scale, batch-oriented data processing: sorting,

parsing, conversion, algorithmic crunching, etc.

(3) Exploratory and predictive analytics – Business

Intelligence.
Data model:
Graph Database nodes and edges

Systems
Nodes may
have properties
(including ID)

Edges may have

labels or roles
Graph
Databases

❖ Based on graph
theory
❖ Vertical Scaling
❖ No clustering
❖ Transactions exist
❖ ACID followed
❖ Examples
:Neo4j,Amazon
Neptune, OrientDB,
Dgraph.

UNIT II- 79
Graph
Databases

UNIT II- 80
•In general, graph
databases are useful when
you are more interested in
relationships between data
than in the data itself: for
example, in representing
and traversing social
networks, generating
recommendations, or
conducting forensic
investigations (e.g. pattern
detection).
used to store data as JSON-like document.

Document- helps developers in storing data by using the same

document-model format as used in the application
oriented code.

Database: Examples : MongoDB,CouchDB

JSON : acronym for JavaScript Object Notation

an open-standard file format or data interchange

format that uses human-readable text to transmit
data objects consisting of attribute–value pairs and
array data types.
UNIT II- 82
4.3 Wide-column
Document-oriented stores:
Database:

• Pair each key with complex data structure.

• Indexing : using B-Trees
• Data stored in a format of documents which may
contain multiple different key-value pairs or even
nested documents.

UNIT II- 83
Document databases are good for storing and managing
Big Data-size collections of literal documents, like text
documents, email messages, and XML documents, as
well as conceptual “documents” like de-normalized
(aggregate) representations of a database entity such as a
product or customer.

They are also good for storing “sparse” data in general,

that is to say irregular (semi-structured) data that would
require an extensive use of “nulls” in an RDBMS.
enables good productivity in the application development as it is
not required to store data in a structured format.

is a better option for managing and handling large data sets.

provides high scalability.

Users can quickly access data from the database through

key-value.
Advantages
of NoSQL designed for use with low-cost commodity hardware.

Database
Massive volumes of data (Big Data) are easily handled by NoSQL
databases.

Economy: can be easily installed in cheap commodity hardware

clusters as transaction and data volumes increase. This means
that you can process and store more data at much less cost.

Dynamic schemas: NoSQL databases need no schemas to start

working with data.

UNIT II- 85
NoSQL Document-Based Data Model
MongoDB Database

UNIT II- 86
History of MongoDB
MongoDB Overview
What is MongoDB?

❖ a powerful, flexible and scalable general-purpose

database. It is agile database that allows schemas
to change quickly as application evolve. I
❖ a NoSQL Database.
JSON Format
JSON Cont…
JSON Features

● Light-weight text-based open standard for

human-readable data interchange.
● Extended from JavaScript.
● aspects of data transfer are simplicity, extensibility,
● interoperability, openness and human readability
● Can be parsed by JavaScript Parser
● Can represent simple and complex data
● Support for Unicode
● Can be used in AJAX
● Use Key-Value Pairs
● Is a collection of Objects and Arrays
JSON Datatypes

● Strings

● Number

● Boolean,

● Objects and

● Arrays
JSON Format Example
{

"book": [

"id":"01",

"language": "Java",

"edition": "third",

"author": "Herbert Schildt"

"id":"07",

"language": "C++",

"edition": "second"

"author": "E.Balagurusamy" }]
}
BSON
• “Binary JSON”

• Binary-encoded serialization of JSON-like docs

• a computer interchange format that is mainly used for data storage

and as a network transfer format in the MongoDB database.

• a simple binary form which is used to represent data structures and

associative arrays (often called documents or objects in MongoDB).
BSON Example

{
"_id" : "37010"
"city" : "ADAMS",
"pop" : 2660,
"state" : "TN",
“councilman” : {
name: “John Smith”
address: “13 Scenic Way”
}
}
Why MongoDB ?
Why MongoDB ? Cont…
Replic
a
Pro’s and Con’s of MongoDB
SQL Vs MongoDB

SQL Concepts MongoDB Concepts

database database
Table, View Collection
Row Document (BSON Document)
Column Field
Index Index
Table Join Embedded documents & Linking
Primary key Primary Key
Specify any unique column or column In MongoDB, the primary key is
combination as primary key. automatically set to the _id field.
aggregation (e.g. group by) aggregation pipeline
Schema design
⚫ RDBMS: join
Schema design
Schema design
MongoDB: Hierarchical Objects

• A MongoDB instance may have zero or more ‘databases’

• A database may have zero or more ‘collections’.

• A collection may have zero or more ‘documents’.

• A document may have one or more ‘fields’.

• MongoDB ‘Indexes’ function much like their RDBMS

counterparts.

UNIT-II
Replication
⚫ Replica Sets and Master-Slave
⚫ replica sets are a functional superset of master/slave
and are handled by much newer, more robust code.
Replication
⚫ Only one server is active for writes (the primary, or
master) at a given time – this is to allow strong
consistent (atomic) operations. One can optionally
send read operations to the secondaries when
eventual consistency semantics are acceptable.
Why Replica Sets
⚫ Data Redundancy
⚫ Automated Failover
⚫ Read Scaling
⚫ Maintenance
⚫ Disaster Recovery(delayed secondary)
Sharding for horizontal scaling

• Sharding : a method for distributing data across multiple machines

• Sharded Cluster : A MongoDB sharded cluster consists of the following components:
– shard: Each shard contains a subset of the sharded data. Each shard can be deployed as
a replica set.
– mongos: The mongos acts as a query router, providing an interface between client
applications and the sharded cluster.
– config servers: Config servers store metadata and configuration settings for the cluster.
As of MongoDB 3.4, config servers must be deployed as a replica set (CSRS).

UNIT-II
Sharding
• Sharding is the process of distributing data across multiple
machines and it is MongoDB's approach to meet the demands of
data growth.

• As the size of the data increases, a single machine may not be

sufficient to store the data nor provide an acceptable read and
write throughput

• MongoDB uses sharding to support deployments with very large

data sets and high throughput operations.

UNIT II- 109

Sharded Cluster
• Sharded Cluster : A MongoDB sharded
cluster consists of the following components:
– Shard
– mongos
– config servers

UNIT-II
Actual
Sharding
Replication & Sharding conclusion
⚫ sharding is the tool for scaling a system, and
replication is the tool for data safety, high availability,
and disaster recovery. The two work in tandem yet are
orthogonal concepts in the design.
Shard
– Shard: Each shard contains a subset of the sharded data. Each shard
can be deployed as a replica set.

– A replica set is a group of mongod instances that host the same

data set.
– In a replica, one node is primary node that receives all write
operations.
– All other instances, such as secondaries, apply operations from the
primary so that they have the same data set.
– Replica set can have only one primary node.

113
mongod

• mongod is the primary daemon process for the

MongoDB system.

• It handles data requests, manages data access,

and performs background management
operations.

UNIT II- 114

mongos
– mongos:
• mongos provides interface between the client applications
and the sharded cluster.
• The mongos acts as a query router, providing an interface
between client applications and the sharded cluster.

– config servers:
– Config servers store metadata and configuration settings for
the cluster.
– As of MongoDB 3.4, config servers must be deployed as a
replica set CSRS (Config Servers as Replica Sets)

UNIT II- 115

Config servers (contd)

• Config servers store the metadata for a sharded cluster.

• The metadata reflects state and organization for all data

and components within the sharded cluster.

• The metadata includes the list of chunks on every shard

• Each sharded cluster must have its own config servers.

UNIT II- 116

Sharded and Unsharded collections

• A database can have a mixture of sharded and

unsharded collections.

• Unsharded collections are stored on a primary

shard.

UNIT II- 117

Sharded collections are partitioned and
distributed across the shards in the cluster.

UNIT II- 118

Sharding

UNIT II- 119

Sharding

UNIT II- 120

Sharding

UNIT II- 121

Sharding

• Deploying multiple mongos routers supports high

availability and scalability.

• A common pattern is to place a mongos on each

application server.

UNIT II- 122

Sharding Requirements

• Sharding requires at least two shards to distribute

sharded data.

• Single shard sharded clusters may be useful if you

plan on enabling sharding in the near future, but
do not need to at the time of deployment.

UNIT II- 123

Sharding Requirements

• A shard contains a subset of sharded data for a

sharded cluster.

• Together, the cluster's shards hold the entire data

set for the cluster.

• As of MongoDB 3.6, shards must be deployed as a

replica set to provide redundancy and high
availability.

UNIT II- 124

Advantages of Sharding
1. Reads / Writes

• MongoDB distributes the read and write workload across

the shards in the sharded cluster, allowing each shard to process a
subset of cluster operations.

• Both read and write workloads can be scaled horizontally across

the cluster by adding more shards.

• For queries that include the shard key or the prefix of a compound
shard key, mongos can target the query at a specific shard or set of
shards. These targeted operations are generally more efficient
than broadcasting to every shard in the cluster.
UNIT II- 125
Advantages of Sharding

2. Storage Capacity

• Sharding distributes data across the shards in the cluster,

allowing each shard to contain a subset of the total cluster
data.

• As the data set grows, additional shards increase the

storage capacity of the cluster.

UNIT II- 126

Advantages of Sharding

3. High Availability
• A sharded cluster can continue to perform partial read /
write operations even if one or more shards are
unavailable.

• While the subset of data on the unavailable shards cannot

be accessed during the downtime, reads or writes
directed at the available shards can still succeed.
4. Useful for worldwide distribution of applications where
communication links between the data centers would
otherwise be a bottleneck.
UNIT II- 127
Disadvantages of Sharding:

Increased complexity of Query Language

Increased complexity due to : partitioning, load balancing,

co ordinating, ensuring integrity etc.

Increased operational complexity :: adding / removing

indexes, adding / deleting fields(columns of collection) etc.

UNIT II- 128

MongoDB Processes and configuration

UNIT II- 129

MongoDB Processes and configuration

• mongod – database instance

• mongos –
– Sharding processes
– Analogous to a database router.
– Processes all requests
– Decides how many and which mongods should receive the query
– Mongos collates the results, and sends it back to the client.

• mongo – an interactive shell ( a client)

Fully functional JavaScript environment for use with a MongoDB

UNIT II-
130
Terminology and Concepts
SQL Terms\ Concepts Mongo DB Terms\ Concepts
Database Database
Table Collection
Row Documents
Column Field
Index Index
Primary key Primary Key
How To Install MongoDB?
• To install the MongoDB first download the mongodb
community server version 4.2.12(zip file) as per your
Operating System from :
https://www.mongodb.com/try/download/community

Select Zip
Connectivity Between Client and
Server

For Mongodb 4.2.12

1. Server Configuration :
We need to start the mongodb server using mongod
through Command Prompt
Create a blank folder for storing database on any
drive.
Syntax for starting server :
path of mongodb bin\ mongod.exe --dbpath
path_of_ newly_created_folder
Client Configuration :
• Client : path of bin\mongo.exe
Connectivity Between Client and
Server

Example : If MongoDB is extracted in G:\ drive,copy

the path till bin folder and assume new folder
(My_MongoDB) is created on E:\ drive then server
can be started as follows through command prompt :
G:\mongodb-win32-x86_64-2012plus-4.2.12\mongo
db-win32-x86_64-2012plus-4.2.12\bin>mongod.exe
--dbpath E:\My_MongoDB
Connectivity Between Client and
Server

Client Configuration :
G:\mongodb-win32-x86_64-2012plus-4.2.12\mongo
db-win32-x86_64-2012plus-4.2.12\bin>mongo.exe
Executables

Mongo
MySQL Oracle Informix DB2
DB
Database DB2
mongod mysqld oracle IDS
Server Server
Database DB-Acces DB2
mongo mysql sqlplus
Client s Client
MongoDB CRUD Operations

• Create Operations
• Read Operations
• Update Operations
• Delete Operations

UNIT II- 137

Basic commands

show dbs
Print a list of all databases on the server.

UNIT II- 138

Basic commands

> use myfirstdb

• Switch current database to < myfirstdb >. The mongo shell

variable db is set to the current database.

UNIT II- 139

Basic commands

> show collections

• Print a list of all collections for current database.

UNIT II- 140

Creating a Collection

• Syntax :
>db.createCollection (“ collection_name”)
Consider ,we want to store student information
so we can create a Student_info collection as
follows :
>>db.createCollection (“Student_info”)

UNIT II- 141

Creates a new collection named as “xyz”
Inserting Documents in a Collection

>db.collectionname.insert({key1:value,key2:value,...})
Inserts a document or documents into a collection.
>db.Student_info.insert ({id:“151”,
name:“Vasundhara”,city:”Pune”})

UNIT II- 142

Display documents

>db.collectionname.find()

Returns all documents from a collection and returns all fields

for the documents.
>db.Student_info.find()
{ "_id" : ObjectId("6051ea45f50ee3d1871c3398"), "id" : "151",
"name" : "Vasundhara", "city" : "Pune" }

UNIT II- 143

Other operations
• Insert one document :
• db.collection_name.insert({})
• Insert multiple documents
• db.collectionname.insert([{document1, },{document2, }, {
document3}])

• Display all documents

• db.collection_name.find()
• db.collection_name.find.pretty() //to display in structured
format

UNIT II- 144

Insert multiple documents

UNIT II- 145

CRUD operations

▪ Update🡪
db.collection.update( <query>, <update>, <options> )
syntax : db.collection.update(query, update, options)
Example :
db.Student_info.update(
{ name:”Pankaj”},
{ $set: { age:24 } }
)

UNIT II- 146

Update Operation : Upsert

db.collection.update( <query>, <update>, { upsert: true } )

Optional. If set to true, creates a new document when no document
matches the query criteria. The default value is false, which does not
insert a new document when no match is found.

Example
:>db.Student_info.update({name:"Ruhi"},{$set:{age:23}},{upsert
:true})

UNIT II- 147

Update Operation : upsert

UNIT II- 148

Update Operation : multi

db.collection.update( <query>, <update>, { multi: true } )

Optional. If set to true, updates multiple documents that meet the query
criteria. If set to false, updates one document. The default value is
false.

Example : Assume there are 2 documents with name as Jyoti.So to update city
value for both we can use following command :

>db.Student_info.update({name:"Jyoti"},{$set:{city:"Bang
alore"}},{multi:true})

UNIT II- 149

Update Operation : multi

UNIT II- 150

}]

Delete operation

▪ Delete🡪
db.collection.remove( <query>, <justOne> )
▪ Collection specifies the collection or the ‘table’ to
store the document
Example :
> db.Student_info.remove({name:"Pankaj"})

UNIT II- 151

List of Other Commands
• Use of Regular Expressions :
• { <field>: { $regex: /pattern/, $options: '<options>' } }

• Pattern Matching :
db.Student_info.find( { name: { $regex: “^V.*” } } ) //starting with ‘V’

> db.Student_info.find({name:{$regex:"a"}}) // haing substring ‘a’

db.Student_info.find( { name: { $regex: /i$/ } } ) //ending with i

db.Student_info..find({name:{$regex:”^s.*m$/}}) //displays the docs starting
with R and ending with i
> db.Student_info.find({name:{$regex:new RegExp("^J.*i$","i")}}) //displays
UNIT II- 152
the docs starting with J and ending with i. case insensitive match
List of Other Commands
Querying Arrays:
insert array :
>
db.Student_info.insert({name:"Pooja",marks:[56,69,75],},{name:"A
jay",age:20,marks:[80,90,65]})
WriteResult({ "nInserted" : 1 })

> db.v.find( { marks: {$all: [40] }} )

UNIT II- 153

John Von Neumann The Scientific Genius Who Pioneered The Modern Computer Game Theory Nuclear Deterrence and Much More
100% (1)
John Von Neumann The Scientific Genius Who Pioneered The Modern Computer Game Theory Nuclear Deterrence and Much More
403 pages
DBMS PPT 1
No ratings yet
DBMS PPT 1
27 pages
Unit - 1 DDB
No ratings yet
Unit - 1 DDB
34 pages
High - Temp Component Life
100% (1)
High - Temp Component Life
337 pages
Water Well Drilling Machine and Tools Catalogue
No ratings yet
Water Well Drilling Machine and Tools Catalogue
49 pages
Amazon Braket: Developer Guide
No ratings yet
Amazon Braket: Developer Guide
54 pages
ADBMS
No ratings yet
ADBMS
84 pages
Unit 1 - Basic Concepts
No ratings yet
Unit 1 - Basic Concepts
119 pages
Unit 1
No ratings yet
Unit 1
76 pages
Lecture Notes Solid State Physics 1
No ratings yet
Lecture Notes Solid State Physics 1
28 pages
DBMS Unit - 1
No ratings yet
DBMS Unit - 1
12 pages
DBMS Unit1
No ratings yet
DBMS Unit1
67 pages
Database
No ratings yet
Database
72 pages
RDBMS Unit-1
No ratings yet
RDBMS Unit-1
21 pages
Advance Concept in Data Bases Unit-3 by Arun Pratap Singh
100% (2)
Advance Concept in Data Bases Unit-3 by Arun Pratap Singh
81 pages
Amarado 2
No ratings yet
Amarado 2
22 pages
DBMS PPT 1 Eng
No ratings yet
DBMS PPT 1 Eng
74 pages
Dbms Essays U1-4
No ratings yet
Dbms Essays U1-4
78 pages
Chapter 5 - Distributed Database Systems
No ratings yet
Chapter 5 - Distributed Database Systems
31 pages
Dbms + SQL Sheet
No ratings yet
Dbms + SQL Sheet
78 pages
Ddbms Unit 1 Part1
No ratings yet
Ddbms Unit 1 Part1
23 pages
Ddbms-Unit 1 Part2
No ratings yet
Ddbms-Unit 1 Part2
16 pages
U-1 DBMS
No ratings yet
U-1 DBMS
31 pages
Practical File of RDBMS: Mata Gujri College, Fatehgarh Sahib Punjab India 140406
No ratings yet
Practical File of RDBMS: Mata Gujri College, Fatehgarh Sahib Punjab India 140406
73 pages
Chemistry 12 (PBA QIB)
No ratings yet
Chemistry 12 (PBA QIB)
27 pages
Lecture3-Distributed Introduction
No ratings yet
Lecture3-Distributed Introduction
38 pages
PCC Alok
No ratings yet
PCC Alok
29 pages
Unit 1 - Scsa3008 - Distributed Database and Information
No ratings yet
Unit 1 - Scsa3008 - Distributed Database and Information
23 pages
S-Advance Database Management System 1
No ratings yet
S-Advance Database Management System 1
68 pages
Cc-6-Unit 1 & 2 - 240227 - 115659
No ratings yet
Cc-6-Unit 1 & 2 - 240227 - 115659
20 pages
20it403 DBMS Digital Material Unit V
No ratings yet
20it403 DBMS Digital Material Unit V
74 pages
Lecture Note 1 (DBMS Basics and Languages)
No ratings yet
Lecture Note 1 (DBMS Basics and Languages)
22 pages
2 RDBMS Unit 2
No ratings yet
2 RDBMS Unit 2
21 pages
Introduction To DDBMS Enhanced
No ratings yet
Introduction To DDBMS Enhanced
17 pages
Lecture 1 Introduction
No ratings yet
Lecture 1 Introduction
15 pages
Types of Databases
No ratings yet
Types of Databases
9 pages
AdDB Chap 1
No ratings yet
AdDB Chap 1
42 pages
Database Systems
No ratings yet
Database Systems
86 pages
Design of Short Columns
No ratings yet
Design of Short Columns
26 pages
Geography F1T1 2024 QS Teacher - Co - .Ke
No ratings yet
Geography F1T1 2024 QS Teacher - Co - .Ke
4 pages
A Fatigue Driving Detection Algorithm Based On Facial Multi-Feature Fusion
No ratings yet
A Fatigue Driving Detection Algorithm Based On Facial Multi-Feature Fusion
16 pages
Data Anal
No ratings yet
Data Anal
53 pages
DBMS Module1 2023 v1
No ratings yet
DBMS Module1 2023 v1
63 pages
Visual Basic 6.0 Documentation
No ratings yet
Visual Basic 6.0 Documentation
33 pages
Engineering Support of Planning and Scheduling Revsion 81
No ratings yet
Engineering Support of Planning and Scheduling Revsion 81
22 pages
Group Members: 1. Shucayb Mohamed Ismail 2. Abdihafid Ismail Salad 3. Nimo Ahmed Hassan 4. Nimo Khadar Ahmed
No ratings yet
Group Members: 1. Shucayb Mohamed Ismail 2. Abdihafid Ismail Salad 3. Nimo Ahmed Hassan 4. Nimo Khadar Ahmed
20 pages
Adbms Data Warehousing Core
No ratings yet
Adbms Data Warehousing Core
9 pages
D B M S: ATA ASE Anage Me NT Ystem
No ratings yet
D B M S: ATA ASE Anage Me NT Ystem
114 pages
DBMS - Chapter 1
No ratings yet
DBMS - Chapter 1
45 pages
ISD IISEM ME-v1 PDF
No ratings yet
ISD IISEM ME-v1 PDF
220 pages
.Ashwani - Mishra
No ratings yet
.Ashwani - Mishra
7 pages
Dbms Notes
No ratings yet
Dbms Notes
48 pages
Claude Shannon Masters Thesis
100% (3)
Claude Shannon Masters Thesis
7 pages
Daftar STandard Method
No ratings yet
Daftar STandard Method
33 pages
Lefikir PowerPoint
No ratings yet
Lefikir PowerPoint
15 pages
Assignment EE5179 ME20B145 Report
No ratings yet
Assignment EE5179 ME20B145 Report
6 pages
Truss Bridge
No ratings yet
Truss Bridge
23 pages
M. Tech. Chemical 2018
No ratings yet
M. Tech. Chemical 2018
37 pages
Introduction To Database Systems
No ratings yet
Introduction To Database Systems
4 pages
Data Base System Assignment
No ratings yet
Data Base System Assignment
4 pages
Activity Intro To DB
No ratings yet
Activity Intro To DB
4 pages
Lab-4 Report
No ratings yet
Lab-4 Report
8 pages
14 Administering Databases
No ratings yet
14 Administering Databases
4 pages
Cs9152 Unit I
No ratings yet
Cs9152 Unit I
52 pages
Mobil™ Dexron-VI ATF: Product Description
No ratings yet
Mobil™ Dexron-VI ATF: Product Description
2 pages
CS8492-Database Management Systems-UNIT 5
100% (1)
CS8492-Database Management Systems-UNIT 5
20 pages
Third Order Intercepts
No ratings yet
Third Order Intercepts
6 pages
ADBMS Tutorial
No ratings yet
ADBMS Tutorial
6 pages
8051 Instruction Set
No ratings yet
8051 Instruction Set
50 pages
Aqa Mm1b QP Jan13
No ratings yet
Aqa Mm1b QP Jan13
20 pages
Module 1
No ratings yet
Module 1
24 pages
Midas Gen Analysis Reference
50% (2)
Midas Gen Analysis Reference
323 pages
Prof. Ishani Saha Computer Department Mpstme (Nmims)
No ratings yet
Prof. Ishani Saha Computer Department Mpstme (Nmims)
38 pages
Correct Solution For Dominator Task From Codility Johnnyjavago Java Passion Coding - HTM
No ratings yet
Correct Solution For Dominator Task From Codility Johnnyjavago Java Passion Coding - HTM
14 pages
E GMAT SC Complete StudyPlan
No ratings yet
E GMAT SC Complete StudyPlan
6 pages
Tybca Recent Trends in It Chpter 1
No ratings yet
Tybca Recent Trends in It Chpter 1
16 pages
Advanced Data Base Management Systems
No ratings yet
Advanced Data Base Management Systems
35 pages
Formlabs Fuse F1 - Sift Tech Specs
No ratings yet
Formlabs Fuse F1 - Sift Tech Specs
4 pages
Morris 2014
No ratings yet
Morris 2014
11 pages
Dbms Complete Notes With Addons
No ratings yet
Dbms Complete Notes With Addons
152 pages
Esci JPP
0% (1)
Esci JPP
27 pages
Cs9152 DBT Unit I Notes
100% (1)
Cs9152 DBT Unit I Notes
53 pages
Advanced DataBases W
No ratings yet
Advanced DataBases W
5 pages
9 Fraunhofer Snail Trails
No ratings yet
9 Fraunhofer Snail Trails
4 pages
Distributed Databases: Indu Saini (Research Scholar) IIT Roorkee Enrollment No.: 10926003
No ratings yet
Distributed Databases: Indu Saini (Research Scholar) IIT Roorkee Enrollment No.: 10926003
14 pages
Chapter 2 Exercises and Answers: Answers Are in Blue
No ratings yet
Chapter 2 Exercises and Answers: Answers Are in Blue
6 pages
Database Management System
From Everand
Database Management System
Manish Soni
No ratings yet
Database And Computer Management: SERIES 1, #3
From Everand
Database And Computer Management: SERIES 1, #3
Elias Mutegi
No ratings yet
DBMS MASTER: Become Pro in Database Management System
From Everand
DBMS MASTER: Become Pro in Database Management System
Ummed Singh
No ratings yet
Databases: System Concepts, Designs, Management, and Implementation
From Everand
Databases: System Concepts, Designs, Management, and Implementation
Jonathan Rigdon
No ratings yet