0% found this document useful (0 votes)
4 views153 pages

BDT Unit 02 - Part1

The CET4001B Big Data Technologies course aims to provide students with an understanding of Big Data concepts, NoSQL databases, and application design for distributed systems. It covers various database types, including centralized, distributed, relational, and NoSQL databases, along with their advantages and limitations. The course also emphasizes the use of Big Data visualization tools and the application of these technologies in real-world scenarios.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views153 pages

BDT Unit 02 - Part1

The CET4001B Big Data Technologies course aims to provide students with an understanding of Big Data concepts, NoSQL databases, and application design for distributed systems. It covers various database types, including centralized, distributed, relational, and NoSQL databases, along with their advantages and limitations. The course also emphasizes the use of Big Data visualization tools and the application of these technologies in real-world scenarios.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 153

CET4001B Big Data

Technologies
Department of Computer Engineering and
Technology

1/16/2024 Big Data Analytics Lab 1


CET4001B Big Data Technologies

Teaching Scheme Credits: 02 + 01

Theory: 3 Hrs / Week Practical: 2Hrs/Week

Course Objectives:
•• Understand the various aspects of Big Data.
•• Learn the concepts of NoSQL for Big Data.
•• Design an application for distributed systems on Big Data.
•• Explore the various Big Data visualization tools.

Course Outcomes:
•• Apply the insights of Big Data in business applications.
•• Illustrate the application of MongoDB in real world applications.
•• Build hadoop based distributed systems for real world problem.
•• Apply and utilize big data visualization tools for real world applications.

1/16/2024 Big Data Analytics Lab 2


• Types of databases
• structured versus unstructured data
• NoSQL movement and concept of NoSQL
database
• comparative study of SQL and NoSQL
• Types and examples of NoSQL database

Unit-II: NoSQL • key value store, document store,


columnar databases, graph databases.

databases for • Characteristics of NoSQL


• NoSQL data modelling

Big Data • Advantages of NoSQL


• CAP theorem
• BASE properties
• Sharding
• characteristics, advantages, types.
• NoSQL using MongoDB
• - MongoDB shell, data types, CRUD
operations, querying, aggregation
framework operators, indexing.

1/16/2024 UNIT II- 3


Types of Databases

UNIT II- 4
4
Centralized Database

A type of database that stores data at a centralized


database system.

Comforts the users to access the stored data from


different locations through several applications that
contain the authentication process to let users access
data securely.
Example of a Centralized database : a Central
Library that carries a central database of each library in
a college/university.

UNIT II- 5
5
Centralized
Database

UNIT II- 6
6
It has decreased
It is less costly the risk of data
because fewer management, i.e.,
vendors are manipulation of
required to handle data will not affect
the data sets. the core data.

Advantages It provides better


data quality, which Data consistency is

of enables
organizations to
establish data
maintained as it
manages data in a
central repository.

Centralized standards.

Database

UNIT II- 7
The size of the centralized
database is large, which
increases the response time
for fetching the data.

Limitations
of not easy to update such an
Centralized extensive database system.
Database
If any server failure occurs,
entire data will be lost, which
could be a huge loss.

UNIT II- 8
Distributed Database System
• Data stored at a number of sites
each site logically consists of a
single processor.
• Processors at different sites are
interconnected by a computer
network no multiprocessors
– parallel database systems
• Distributed database is a
database, not a collection of files
data logically related as exhibited
in the users’ access patterns
– relational data model
Distributed Database System • D-DBMS is a full-fledged DBMS
– not remote file system, not a TP
system

Database Management System 9


Distributed Database
Distributed Database

– Unlike a centralized
database system,
data is distributed
among different
database systems
of an organization.
– are connected via
communication
links. which help
the end-users to
access the data
easily.
– Examples : Apache
Cassandra, HBase,
Ignite, etc.

UNIT II- 11
11
Distributed Database Types

Homogeneous DDB:
•• Execute on the same operating
system
•• use the same application
process
•• carry the same hardware
Heterogeneous
devices. DDB:
•• execute on different operating
systems
•• under different application
procedures,
•• carries different hardware
devices. UNIT II- 12
12
Advantages

• Increased Reliability and availability


• Reliability is basically defined as the probability that a system is
running at a certain time whereas Availability is defined as the
probability that the system is continuously available during a time
interval.
• Easier Expansion
• In a distributed environment expansion of the system in terms of
adding more data, increasing database sizes, or adding more data,
increasing database sizes or adding more processor is much easier.
• Improved Performance
• We can achieve inter-query and intra-query parallelism by executing
multiple queries at different sites by breaking up a query into a
number of sub-queries that basically executes in parallel which
basically leads to improvement in performance

Database Management System 13


Advantages of Distributed
Database

Modular development is possible i.e., the system can be expanded by including


new computers and connecting them to the distributed system.

One server failure will not affect the entire data set.

A user doesn’t know where the data is located physically.

the data presented to the user as if it were located locally.

Data can be joined and updated from different tables which are located on
different machines.

It is secure.

UNIT II- 14
14
Disadvantages
• The distributed database is quite complex
• This database is more expensive as it is complex and hence,
difficult to maintain.
• As it is distributed system it requires database to be more
secure and each and ever node should be secure as well.
• Data integrity and Data Redundancy will not be maintained.

Database Management System 15


Limitations of Distributed
Databases

Network traffic is increased in Different data formats are used


a distributed database. in different systems.

While recovering a failed


system, the DBMS has to
Managing distributed deadlock
make sure that the recovered
is a difficult task.
system is consistent with other
systems.

UNIT II- 16
Relational Database

which stores data in


the form of rows uses SQL for storing,
based on the relational (tuple) and manipulating, as well
data model, columns(attributes), as maintaining the
and together forms a data.
table(relation).

Each table carries a


Examples : MySQL,
E.F. Codd invented the key that makes the
Microsoft SQL Server,
database in 1970. data unique from
Oracle, etc.
others.

UNIT II- 17
17
Relational
Database
Representation

UNIT II- 18
18
4. Cloud Database

provides users with


[[Software as a Service
data is stored in a various cloud
(SaaS), Platform as a
virtual environment and computing services
Service (PaaS), and
executes over the cloud (SaaS, PaaS, IaaS,
Infrastructure as a
computing platform. etc.) for accessing the
Service (IaaS) ]]
database.

UNIT II- 19
Cloud
Database

• There are numerous


cloud platforms, but
the best options are:

• Amazon Web Services


• (AWS)
• Microsoft Azure
• Kamatera
• PhonixNAP
• ScienceSoft
• Google Cloud SQL, etc.

UNIT II- 20
Object-oriented Database

Uses the object-based data model approach for storing


data in the database system.

Data is represented and stored as objects which are similar


to the objects used in the object-oriented programming
language.

Examples : Smalltalk in Gemstone,LISP in Gbase.


UNIT II- 21
Object-oriented Database
Representation

UNIT II- 22
Hierarchical Database :

Stores data in the


form of
parent-children
relationship nodes.

It organizes data in a
tree-like structure.

UNIT II- 23
Hierarchical Databases :

Mandates that each child record has only one parent,


whereas each parent record can have one or more
child records.

In order to retrieve data from a hierarchical database


the whole tree needs to be traversed starting from the
root node.

Examples : IBM Information Management


System(IMS),RDM Mobile.

UNIT II- 24
Network Database

– typically follows the network data


model.

– the representation of data is in the


form of nodes connected via links
between them.

– Unlike the hierarchical database, it


allows each record to have multiple
children and parent nodes to form a
generalized graph structure.

UNIT II- 25
Network
Database
Continued

• Some well-known database systems that use the network model


include:
• Raima Database Manager
• Integrated DMS (IDMS)
• TurboIMAGE

UNIT II- 26
NoSQL Database

Not Only SQL :

a type of database that is used for storing a wide range of data sets.

non-relational
presents a wide variety of database technologies in response to the
demands.
Schema-less
relaxes one or more of the ACID properties(Will be explained in CAP
theorem)

UNIT II- 27
27
Motivation for NoSQL
Database

Limitations of relational databases

Not suitable for distributed applications


because:
•• Joins are expensive
•• Hard to scale horizontally
•• Can’t handle unstructured data
•• Expensive : Product Cost ,Hardware maintenance.
•• High availability not possible
•• no partition tolerance
•• speed is less

UNIT II- 28
28
Structured vs Unstructured

• Structured systems are those where the activity of processing


and output is predetermined and highly organized.
• ATM transactions, airline reservations, manufacturing inventory control
systems, point of sale systems are all forms of structured systems
• unstructured systems are those that have little or no
predetermined form or structure
• Unstructured systems include email, reports, contracts, and other
communications.
• The rules of unstructured system are fewer and less complex

Analyzing unstructured data requires processing of huge amounts of data. Which


SQL databases are no good for, and were never designed for.

NoSQL databases have evolved to handle this huge data properly.


Data
Structures

•Data comes in multiple


forms - including
structured, unstructured
form
•80–90% of future data
growth coming from
nonstructured data types
Data
structures
(contd..)

•Structured Data : Data


containing a defined
data type, format, and
structure (that is,
transaction data, online
analytical processing
[OLAP] data cubes,
traditional RDBMS, CSV
files, and even simple
spreadsheets)
Data
Structures
(contd..)

•Semi-structured data:
Textual data files with a
discernible pattern that
enables parsing (such as
Extensible Markup
Language [XML] data
files that are
self-describing and
defined by an XML
schema)
Data Structures (contd..)

◼Unstructured data: Data that has no inherent


structure, which may include text documents,
PDFs,images, and video.
Types of data in Big Data Scenario
• Structured data :
■ Is highly-organized and formatted in a way so it's easily searchable
in relational databases.
■ Common relational database applications with structured data :
airline reservation systems, inventory control, sales transactions,
and ATM activity.

•Unstructured data :
■ Unstructured data has no pre-defined format or organization,
making it much more difficult to collect, process, and analyze.
■ Unstructured data has internal structure but is not structured via
pre-defined data models or schema.
■ It may be textual or non-textual, and human- or
machine-generated.
■ It may also be stored withinUNIT
a non-relational
II- database like NoSQL.
34
34
Types of data in Big Data
Scenario

• Semi-Structured data :
■ maintains internal tags and markings that identify
separate data elements, which enables information
grouping and hierarchies.
■ Both documents and databases can be
semi-structured.
■ This type of data only represents about 5-10% of the
structured/semi-structured/unstructured data pie.
■ Typical use is in OO models
■ Examples : CSV,XML,JSON

UNIT II- 35
35
Difference between Structured and Unstructured Data

UNIT II- 36 36
Motivation
for NoSQL
Database

•Relational Databases
are not suitable for
distributed computing

UNIT II- 37
37
Motivation
for NoSQL
Database

UNIT II- 38
38
Motivation
for NoSQL
Database
Performance of RDBMS for
various applications

UNIT II- 39
39
Addressing system growth

Every database has This makes data When the memory Scalability :
to be scaled to available at all times of the database is capability of a
address the huge for users. drained, or when it system, network, or
amount of data cannot handle process to handle a
being generated multiple requests, it growing amount of
each day. is not scalable. work, or its potential
to be enlarged to
accommodate that
growth.
Vertical Scaling and
Horizontal Scaling

UNIT II- 40
Scaling datbases

• Elasticity : degree to which


a system can adapt to
workload changes by
provisioning and
de-provisioning resources in
an on-demand manner,
such that,
• At each point in time the
available resources match
the current demand as
closely as possible

• Types of Scaling :

UNIT II- 41
Vertical Scaling

Vertical Scaling :adopted when the database


couldn’t handle the large amount of data.
•• Example :Suppose you have a database server with
10GB memory and it has exhausted. Now, to handle
more data, you buy an expensive server with memory
of 2TB.
•• it involves adding more power such as CPU and disk
power to enhance your storage process.
•• applicable to applications involving a limited range of
users and minimal querying
•• Application : Relational databases mostly use
vertical scaling.
UNIT II- 42
Vertical Scaling
Limitations :
Advantages : ● Difficult to perform multiple
queries simultaneously.
•• Simple, since everything exists
in a single server. No need to ● Chances of downtime are high,
manage multiple instances. when the server exceeds
•• Performance Gain, because maximum load.
you have faster RAM and ● Expensive. Hardware
memory power on each update. resources are costly, after all.
•• Same Code. No change — You
need not change your
implementation or your code at
all.

UNIT II- 43
Horizontal Scaling

Horizontal Scaling :scaling of the server horizontally


by adding more machines.
•• It divides the data set and distributes the data over multiple servers,
or shards.
•• Each shard is an independent database.
•• Instead of buying a single 2 TB server, we are buying two hundred
10 GB servers.
•• If your application can allow redundancy and involves less joins,
then you can use horizontal scaling.
•• Applications : NoSQL databases mostly use horizontal scaling. It is
less suitable for RDBMS as it relies on strict Consistency and
Atomicity rules.

UNIT II- 44
Horizontal Scaling
Limitations :
Advantages : ● Making joins is difficult,
cheap compared to vertical scaling. due to cross-server
● Lesser Load, Better performance. communication.
● Chances of downtime are less. ● Eventual consistency is
● Resilience and Fault Tolerance. only possible.
● Suitable for Distributed Databases ● It may not be best suited
for bank transactions,

UNIT II- 45
Horizontal and Vertical scaling
What is NoSQL?

NoSQL is a non-relational database management systems

Designed for distributed data stores where very large scale of


data storing needs to be available
•• e.g. Google or Facebook which collects terabits of data every day for their
users
These data stores may not require fixed-table schemas, and
usually avoid join operations and typically scale horizontally

“Non-relational” may be more accurate term than “NoSQL”,


as some NoSQL DBs do support a subset of SQL.
Characteristics of •NoSQL avoids :
• Overhead of ACID
NoSQL Databases transactions
• Complexity of SQL
queries
• Burden of
schema-design
• DBA presence
•Provides :
• Easy and frequent
changes to DB
• Fast Development
• Can handle large data
volume ( Big Data
Applications)
• Schema-less

UNIT II- 48
48
When and when not to use NoSQL

UNIT II- 49
49
NoSQL : Applications and
Popularity

UNIT II- 50
50
Schema-less Data
Model

• No fixed schema to
consider
• No implicit datatypes
• Most considerations
done at application
layer including
transactions
• All aggregate data is
gathered in
documents.

UNIT II- 51
51
CAP – NoSQL Data models

Three basic requirements which exist in a special relation when designing


applications for a distributed architecture.

•• Consistency : the data in the database remains consistent after the execution of an operation.
•• Availability : the system is always on (Service guarantee availability), no downtime
•• Partition Tolerance : the system continues to function even if the communication among the
servers is unreliable, i.e. the servers may be partitioned into multiple groups that cannot
communicate with one another.

Generally, it is not be possible to fulfil all three requirements in a distributed


system. CAP provides the basic requirements for a distributed system to
follow two of the three requirements.

Distributed systems must be partition tolerant (P), so Current NoSQL


databases follow the different combinations of C and A from the CAP
theorem.
CAP Theorem
CAP Theorem

•Acronym for Consistency,


Availability and Partition Tolerance
• Consistency : once data is
written ,all future read requests
will contain that data
• Availability : database is always
available and responsive.
• Partition Tolerance : if part of
database is unavailable ,other
parts are not affected.

UNIT II- 54
CAP Theorem
CA : Consistent and Available
Examples : standalone Mysql server/node which has no replication.
It provides consistency and availability till it goes down.
Applications : Bank Account Balance,Text messages which require higher
consistency.RDBMS are CA systems.
AP : Available and Partition Tolerant
Example : Distributed NoSQL database where replication to nodes happens
asynchronously.
system will always respond, but not all the nodes will have the latest version of the
data when queried
Applications :E Commerce Sites which focus on high availability in case of partitions
in distributed environment by trading off consistency..
CP : Consistent and Partition Tolerant
Similar to CA systems but difference is its applicability to distributed environment.
In Mongodb the primary node is replicated into secondary nodes.If the primary node
fails then system switches to secondary node.During this switch data is not made
available to user.

UNIT II- 55
CAP Theorem

UNIT II- 56
CAP –
NoSQL
Datamodels
• CA - Single site cluster,
therefore all nodes are
always in contact. When
a partition occurs, the
system blocks.
• CP - Some data may not
be accessible, but the
rest is still
consistent/accurate.
• AP - System is still
available under
partitioning, but some of
the data returned may
be inaccurate.
CAP Theorem

No distributed system is safe from network failures, thus network


partitioning generally has to be tolerated.

In the presence of a partition, one is then left with two options:


consistency or availability.

When choosing consistency over availability: the system will return an


error or a time out if particular information cannot be guaranteed to be up
to date due to network partitioning.
When choosing availability over consistency: the system will always
process the query and try to return the most recent available version of
the information, even if it cannot guarantee it is up to date due to network
partitioning.

UNIT II- 58
In the absence of network failure – that is, when
the distributed system is running normally – both
availability and consistency can be satisfied.
CAP
Theorem
CAP is frequently misunderstood as if one has to
choose to abandon one of the three guarantees
at all times. In fact, the choice is really between
consistency and availability only when a
network partition or failure happens; at all
other times, no trade-off has to be made.

UNIT II- 59
A network partition refers to network
decomposition into relatively independent
subnets for their separate optimization as well
as network split due to the failure of network
devices.In both cases the partition-tolerant
behavior of subnets is expected.

CAP
Partition Tolerance or robustness means
Theorem that a given system continues to operate
even with data loss or node failure.

A single node failure should not cause system


failure.

UNIT II- 60
CAP Theorem
• Database systems designed wit traditional ACID guarantees
in mind such as RDBMS choose consistency over
availability,

Systems designed around the BASE philosophy, common in


the NoSQL movement for example, choose availability over
consistency.

• [[ Network partition refers to network decomposition into


relatively independent subnets for their separate
optimization as well as network split due to failure of
network devices ]]
UNIT II- 61
Limitations of CAP
theorem
▪ When companies such as Google and Amazon were designing
large-scale databases, 24/7 Availability was a key
▪ A few minutes of downtime means lost revenue

▪ When horizontally scaling databases to 1000s of machines, the


likelihood of a node or a network failure increases tremendously

▪ Therefore, in order to have strong guarantees on Availability and


Partition Tolerance, they had to sacrifice “strict” Consistency
(implied by the CAP theorem)

▪ The CAP theorem proves that it is impossible to guarantee strict


Consistency and Availability while being able to tolerate network
partitions

▪ This resulted in databases with relaxed ACID guarantees


UNIT II- 62
Trading-Off Consistency

▪ Maintaining consistency should balance between the strictness


of consistency versus availability/scalability
▪ Good-enough consistency depends on your application

Loose Consistency Strict Consistency

Easier to implement, and is Generally hard to implement, and is


efficient inefficient

UNIT II- 63
The BASE Properties

▪ To overcome the limitations of CAP theorem the BASE


properties were introduced and implemented by such
databases as follows :
▪ Basically Available: the system guarantees Availability
▪ Soft-State: the state of the system may change over time
▪ Eventual Consistency: the system will eventually
become consistent

UNIT II- 64
The BASE Properties

▪ Horizontally scalable Databases use BASE properties


Basically Available: the system guarantees Availability
▪ BASE databases spread data across many storage systems with a high
degree of replication which guarantees
▪ In the unlikely event that a failure disrupts access to a segment of data, this
does not necessarily result in a complete database outage.
Soft-State: the state of the system may change over time
▪ Values stored in the system may change because of the eventual
consistency model
Eventual Consistency: the system will eventually become consistent
As data is added to the system, the system’s state is gradually replicated
across all nodes, during the short period of time before all updated blocks
are replicated, the state of the file system isn’t consistent

By sacrificing Permanent Consistency in favor of Eventual Consistency


developers enable Horizontal Scalability.
UNIT II- 65
Acronym contrived to be
the opposite of ACID
•• Basically Available,
•• Soft state,
BASE •• Eventually Consistent
Transactions

Characteristics

•• Weak consistency – stale data


OK
•• Availability first
•• Best effort
•• Simpler and faster
UNIT II- 66
ACID - BASE

• Basically
• Atomicity Available (CP)
• Consistency • Soft-state
• Isolation • Eventually
• Durability consistent (AP)

67
Data Models and
Types of NoSQL DBs
NoSQL Data
Model
Distributed Key-Value Systems -
Lookup a single value for a key
•• Amazon’s Dynamo
Document-based Systems - Access
data by key or by search of
“document” data.
Types of •• CouchDB
•• MongoDB
NoSQL
Column-based Systems
databases
•• Google’s BigTable
•• Facebook’s Cassandra

Graph-based Systems - Use a graph


structure
•• Google’s Pregel
•• Neo4j
NoSQL
Database
Types
•71

UNIT II- 71
Key-Value Pair (KVP) Stores
Access data (values) by strings called keys.

Data has no required format – data may have any format

Extremely simple interface


•• Data model: (key, value) pairs
•• Basic Operations: Insert(key,value), Fetch(key),Update(key), Delete(key)

Implementation: efficiency, scalability, fault-tolerance


•• Records distributed to nodes based on key
•• Replication
•• Single-record transactions, “eventual consistency”

Example systems
•• Amazon Dynamo
“Value” is stored as a “blob”
- Without caring or knowing what is inside

- Application is responsible for understanding the data

In simple terms, a NoSQL Key-Value store is a single table with


two columns: one being the (Primary) Key, and the other being
the Value.

Each record may have a different


simplest type of database storage

stores single item as a key (or attribute name)


holding its value, together.
NoSQL
Like a hash
Database
Types: Does not require a specific data format.It can be
any.

Key value Data Model : (Key + Value ) pairs.


stores Basic Operations :
insert(key,value),Fetch(key),Update(key),Delete(k
ey)

Examples : Amazon DynamoDB,redis,riak.

74 UNIT II-
Key-Value Store

UNIT II- 75
Column-ba
sed Data
Model
• Column Family :
• Column is the smallest
instance of data.
• It is a row containing
name,value and
timestamp.
• Examples : Apache
Cassandra used by
Facebook

UNIT II- 76
Column-based Data Model
This type of data store is good for

(1) Distributed data storage, especially versioned data


because of the time-stamps.

(2) Large-scale, batch-oriented data processing: sorting,


parsing, conversion, algorithmic crunching, etc.

(3) Exploratory and predictive analytics – Business


Intelligence.
Data model:
Graph Database nodes and edges

Systems
Nodes may
have properties
(including ID)

Edges may have


labels or roles
Graph
Databases

❖ Based on graph
theory
❖ Vertical Scaling
❖ No clustering
❖ Transactions exist
❖ ACID followed
❖ Examples
:Neo4j,Amazon
Neptune, OrientDB,
Dgraph.

UNIT II- 79
Graph
Databases

UNIT II- 80
•In general, graph
databases are useful when
you are more interested in
relationships between data
than in the data itself: for
example, in representing
and traversing social
networks, generating
recommendations, or
conducting forensic
investigations (e.g. pattern
detection).
used to store data as JSON-like document.

Document- helps developers in storing data by using the same


document-model format as used in the application
oriented code.

Database: Examples : MongoDB,CouchDB

JSON : acronym for JavaScript Object Notation

an open-standard file format or data interchange


format that uses human-readable text to transmit
data objects consisting of attribute–value pairs and
array data types.
UNIT II- 82
4.3 Wide-column
Document-oriented stores:
Database:

• Pair each key with complex data structure.


• Indexing : using B-Trees
• Data stored in a format of documents which may
contain multiple different key-value pairs or even
nested documents.

UNIT II- 83
Document databases are good for storing and managing
Big Data-size collections of literal documents, like text
documents, email messages, and XML documents, as
well as conceptual “documents” like de-normalized
(aggregate) representations of a database entity such as a
product or customer.

They are also good for storing “sparse” data in general,


that is to say irregular (semi-structured) data that would
require an extensive use of “nulls” in an RDBMS.
enables good productivity in the application development as it is
not required to store data in a structured format.

is a better option for managing and handling large data sets.

provides high scalability.

Users can quickly access data from the database through


key-value.
Advantages
of NoSQL designed for use with low-cost commodity hardware.

Database
Massive volumes of data (Big Data) are easily handled by NoSQL
databases.

Economy: can be easily installed in cheap commodity hardware


clusters as transaction and data volumes increase. This means
that you can process and store more data at much less cost.

Dynamic schemas: NoSQL databases need no schemas to start


working with data.

UNIT II- 85
NoSQL Document-Based Data Model
MongoDB Database

UNIT II- 86
History of MongoDB
MongoDB Overview
What is MongoDB?

❖ a powerful, flexible and scalable general-purpose


database. It is agile database that allows schemas
to change quickly as application evolve. I
❖ a NoSQL Database.
JSON Format
JSON Cont…
JSON Features

● Light-weight text-based open standard for


human-readable data interchange.
● Extended from JavaScript.
● aspects of data transfer are simplicity, extensibility,
● interoperability, openness and human readability
● Can be parsed by JavaScript Parser
● Can represent simple and complex data
● Support for Unicode
● Can be used in AJAX
● Use Key-Value Pairs
● Is a collection of Objects and Arrays
JSON Datatypes

● Strings

● Number

● Boolean,

● Objects and

● Arrays
JSON Format Example
{

"book": [

"id":"01",

"language": "Java",

"edition": "third",

"author": "Herbert Schildt"

},

"id":"07",

"language": "C++",

"edition": "second"

"author": "E.Balagurusamy" }]
}
BSON
• “Binary JSON”

• Binary-encoded serialization of JSON-like docs

• a computer interchange format that is mainly used for data storage


and as a network transfer format in the MongoDB database.

• a simple binary form which is used to represent data structures and


associative arrays (often called documents or objects in MongoDB).
BSON Example

{
"_id" : "37010"
"city" : "ADAMS",
"pop" : 2660,
"state" : "TN",
“councilman” : {
name: “John Smith”
address: “13 Scenic Way”
}
}
Why MongoDB ?
Why MongoDB ? Cont…
Replic
a
Pro’s and Con’s of MongoDB
SQL Vs MongoDB

SQL Concepts MongoDB Concepts

database database
Table, View Collection
Row Document (BSON Document)
Column Field
Index Index
Table Join Embedded documents & Linking
Primary key Primary Key
Specify any unique column or column In MongoDB, the primary key is
combination as primary key. automatically set to the _id field.
aggregation (e.g. group by) aggregation pipeline
Schema design
⚫ RDBMS: join
Schema design
Schema design
MongoDB: Hierarchical Objects

• A MongoDB instance may have zero or more ‘databases’

• A database may have zero or more ‘collections’.

• A collection may have zero or more ‘documents’.

• A document may have one or more ‘fields’.

• MongoDB ‘Indexes’ function much like their RDBMS


counterparts.

UNIT-II
Replication
⚫ Replica Sets and Master-Slave
⚫ replica sets are a functional superset of master/slave
and are handled by much newer, more robust code.
Replication
⚫ Only one server is active for writes (the primary, or
master) at a given time – this is to allow strong
consistent (atomic) operations. One can optionally
send read operations to the secondaries when
eventual consistency semantics are acceptable.
Why Replica Sets
⚫ Data Redundancy
⚫ Automated Failover
⚫ Read Scaling
⚫ Maintenance
⚫ Disaster Recovery(delayed secondary)
Sharding for horizontal scaling

• Sharding : a method for distributing data across multiple machines


• Sharded Cluster : A MongoDB sharded cluster consists of the following components:
– shard: Each shard contains a subset of the sharded data. Each shard can be deployed as
a replica set.
– mongos: The mongos acts as a query router, providing an interface between client
applications and the sharded cluster.
– config servers: Config servers store metadata and configuration settings for the cluster.
As of MongoDB 3.4, config servers must be deployed as a replica set (CSRS).

UNIT-II
Sharding
• Sharding is the process of distributing data across multiple
machines and it is MongoDB's approach to meet the demands of
data growth.

• As the size of the data increases, a single machine may not be


sufficient to store the data nor provide an acceptable read and
write throughput

• MongoDB uses sharding to support deployments with very large


data sets and high throughput operations.

UNIT II- 109


Sharded Cluster
• Sharded Cluster : A MongoDB sharded
cluster consists of the following components:
– Shard
– mongos
– config servers

UNIT-II
Actual
Sharding
Replication & Sharding conclusion
⚫ sharding is the tool for scaling a system, and
replication is the tool for data safety, high availability,
and disaster recovery. The two work in tandem yet are
orthogonal concepts in the design.
Shard
– Shard: Each shard contains a subset of the sharded data. Each shard
can be deployed as a replica set.

– A replica set is a group of mongod instances that host the same


data set.
– In a replica, one node is primary node that receives all write
operations.
– All other instances, such as secondaries, apply operations from the
primary so that they have the same data set.
– Replica set can have only one primary node.

113
mongod

• mongod is the primary daemon process for the


MongoDB system.

• It handles data requests, manages data access,


and performs background management
operations.

UNIT II- 114


mongos
– mongos:
• mongos provides interface between the client applications
and the sharded cluster.
• The mongos acts as a query router, providing an interface
between client applications and the sharded cluster.

– config servers:
– Config servers store metadata and configuration settings for
the cluster.
– As of MongoDB 3.4, config servers must be deployed as a
replica set CSRS (Config Servers as Replica Sets)

UNIT II- 115


Config servers (contd)

• Config servers store the metadata for a sharded cluster.

• The metadata reflects state and organization for all data


and components within the sharded cluster.

• The metadata includes the list of chunks on every shard

• Each sharded cluster must have its own config servers.

UNIT II- 116


Sharded and Unsharded collections

• A database can have a mixture of sharded and


unsharded collections.

• Unsharded collections are stored on a primary


shard.

UNIT II- 117


Sharded collections are partitioned and
distributed across the shards in the cluster.

UNIT II- 118


Sharding

UNIT II- 119


Sharding

UNIT II- 120


Sharding

UNIT II- 121


Sharding

• Deploying multiple mongos routers supports high


availability and scalability.

• A common pattern is to place a mongos on each


application server.

UNIT II- 122


Sharding Requirements

• Sharding requires at least two shards to distribute


sharded data.

• Single shard sharded clusters may be useful if you


plan on enabling sharding in the near future, but
do not need to at the time of deployment.

UNIT II- 123


Sharding Requirements

• A shard contains a subset of sharded data for a


sharded cluster.

• Together, the cluster's shards hold the entire data


set for the cluster.

• As of MongoDB 3.6, shards must be deployed as a


replica set to provide redundancy and high
availability.

UNIT II- 124


Advantages of Sharding
1. Reads / Writes

• MongoDB distributes the read and write workload across


the shards in the sharded cluster, allowing each shard to process a
subset of cluster operations.

• Both read and write workloads can be scaled horizontally across


the cluster by adding more shards.

• For queries that include the shard key or the prefix of a compound
shard key, mongos can target the query at a specific shard or set of
shards. These targeted operations are generally more efficient
than broadcasting to every shard in the cluster.
UNIT II- 125
Advantages of Sharding

2. Storage Capacity

• Sharding distributes data across the shards in the cluster,


allowing each shard to contain a subset of the total cluster
data.

• As the data set grows, additional shards increase the


storage capacity of the cluster.

UNIT II- 126


Advantages of Sharding

3. High Availability
• A sharded cluster can continue to perform partial read /
write operations even if one or more shards are
unavailable.

• While the subset of data on the unavailable shards cannot


be accessed during the downtime, reads or writes
directed at the available shards can still succeed.
4. Useful for worldwide distribution of applications where
communication links between the data centers would
otherwise be a bottleneck.
UNIT II- 127
Disadvantages of Sharding:

Increased complexity of Query Language

Increased complexity due to : partitioning, load balancing,


co ordinating, ensuring integrity etc.

Increased operational complexity :: adding / removing


indexes, adding / deleting fields(columns of collection) etc.

UNIT II- 128


MongoDB Processes and configuration

UNIT II- 129


MongoDB Processes and configuration

• mongod – database instance

• mongos –
– Sharding processes
– Analogous to a database router.
– Processes all requests
– Decides how many and which mongods should receive the query
– Mongos collates the results, and sends it back to the client.

• mongo – an interactive shell ( a client)


Fully functional JavaScript environment for use with a MongoDB

UNIT II-
130
Terminology and Concepts
SQL Terms\ Concepts Mongo DB Terms\ Concepts
Database Database
Table Collection
Row Documents
Column Field
Index Index
Primary key Primary Key
How To Install MongoDB?
• To install the MongoDB first download the mongodb
community server version 4.2.12(zip file) as per your
Operating System from :
https://www.mongodb.com/try/download/community

Select Zip
Connectivity Between Client and
Server

For Mongodb 4.2.12


1. Server Configuration :
We need to start the mongodb server using mongod
through Command Prompt
Create a blank folder for storing database on any
drive.
Syntax for starting server :
path of mongodb bin\ mongod.exe --dbpath
path_of_ newly_created_folder
Client Configuration :
• Client : path of bin\mongo.exe
Connectivity Between Client and
Server

Example : If MongoDB is extracted in G:\ drive,copy


the path till bin folder and assume new folder
(My_MongoDB) is created on E:\ drive then server
can be started as follows through command prompt :
G:\mongodb-win32-x86_64-2012plus-4.2.12\mongo
db-win32-x86_64-2012plus-4.2.12\bin>mongod.exe
--dbpath E:\My_MongoDB
Connectivity Between Client and
Server

Client Configuration :
G:\mongodb-win32-x86_64-2012plus-4.2.12\mongo
db-win32-x86_64-2012plus-4.2.12\bin>mongo.exe
Executables

Mongo
MySQL Oracle Informix DB2
DB
Database DB2
mongod mysqld oracle IDS
Server Server
Database DB-Acces DB2
mongo mysql sqlplus
Client s Client
MongoDB CRUD Operations

• Create Operations
• Read Operations
• Update Operations
• Delete Operations

UNIT II- 137


Basic commands

show dbs
Print a list of all databases on the server.

UNIT II- 138


Basic commands

> use myfirstdb

• Switch current database to < myfirstdb >. The mongo shell


variable db is set to the current database.

UNIT II- 139


Basic commands

> show collections

• Print a list of all collections for current database.

UNIT II- 140


Creating a Collection

• Syntax :
>db.createCollection (“ collection_name”)
Consider ,we want to store student information
so we can create a Student_info collection as
follows :
>>db.createCollection (“Student_info”)

UNIT II- 141


Creates a new collection named as “xyz”
Inserting Documents in a Collection

>db.collectionname.insert({key1:value,key2:value,...})
Inserts a document or documents into a collection.
>db.Student_info.insert ({id:“151”,
name:“Vasundhara”,city:”Pune”})

UNIT II- 142


Display documents

>db.collectionname.find()

Returns all documents from a collection and returns all fields


for the documents.
>db.Student_info.find()
{ "_id" : ObjectId("6051ea45f50ee3d1871c3398"), "id" : "151",
"name" : "Vasundhara", "city" : "Pune" }

UNIT II- 143


Other operations
• Insert one document :
• db.collection_name.insert({})
• Insert multiple documents
• db.collectionname.insert([{document1, },{document2, }, {
document3}])

• Display all documents


• db.collection_name.find()
• db.collection_name.find.pretty() //to display in structured
format

UNIT II- 144


Insert multiple documents

UNIT II- 145


CRUD operations

▪ Update🡪
db.collection.update( <query>, <update>, <options> )
syntax : db.collection.update(query, update, options)
Example :
db.Student_info.update(
{ name:”Pankaj”},
{ $set: { age:24 } }
)

UNIT II- 146


Update Operation : Upsert

db.collection.update( <query>, <update>, { upsert: true } )


Optional. If set to true, creates a new document when no document
matches the query criteria. The default value is false, which does not
insert a new document when no match is found.

Example
:>db.Student_info.update({name:"Ruhi"},{$set:{age:23}},{upsert
:true})

UNIT II- 147


Update Operation : upsert

UNIT II- 148


Update Operation : multi

db.collection.update( <query>, <update>, { multi: true } )


Optional. If set to true, updates multiple documents that meet the query
criteria. If set to false, updates one document. The default value is
false.

Example : Assume there are 2 documents with name as Jyoti.So to update city
value for both we can use following command :

>db.Student_info.update({name:"Jyoti"},{$set:{city:"Bang
alore"}},{multi:true})

UNIT II- 149


Update Operation : multi

UNIT II- 150


}]

Delete operation

▪ Delete🡪
db.collection.remove( <query>, <justOne> )
▪ Collection specifies the collection or the ‘table’ to
store the document
Example :
> db.Student_info.remove({name:"Pankaj"})

UNIT II- 151


List of Other Commands
• Use of Regular Expressions :
• { <field>: { $regex: /pattern/, $options: '<options>' } }

• Pattern Matching :
db.Student_info.find( { name: { $regex: “^V.*” } } ) //starting with ‘V’

> db.Student_info.find({name:{$regex:"a"}}) // haing substring ‘a’

db.Student_info.find( { name: { $regex: /i$/ } } ) //ending with i


db.Student_info..find({name:{$regex:”^s.*m$/}}) //displays the docs starting
with R and ending with i
> db.Student_info.find({name:{$regex:new RegExp("^J.*i$","i")}}) //displays
UNIT II- 152
the docs starting with J and ending with i. case insensitive match
List of Other Commands
Querying Arrays:
insert array :
>
db.Student_info.insert({name:"Pooja",marks:[56,69,75],},{name:"A
jay",age:20,marks:[80,90,65]})
WriteResult({ "nInserted" : 1 })

> db.v.find( { marks: {$all: [40] }} )

UNIT II- 153

You might also like