0% found this document useful (0 votes)
30 views

Unit V NoSQL Databases

NoSQL database ppt sppu 3rd year engineering

Uploaded by

Supriya Salke
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

Unit V NoSQL Databases

NoSQL database ppt sppu 3rd year engineering

Uploaded by

Supriya Salke
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 124

Government College of Engineering and Research , Avasari

Department of Computer Engineering

DATABASE MANAGEMENT
SYSTEM
TE 2019 Course

Prof.K. B. Sadafale
Assistant Professor
Syllabus
Unit I: Introduction to Database Management Systems and ER Model

UNIT II: SQL and PL/SQL

UNIT III: Relational Database Design

UNIT IV: Database Transaction Management

UNIT V: NoSQL Databases

UNIT VI: Advances in Databases


UNIT V: NoSQL Databases
Introduction to Distributed Database System, Advantages,
Disadvantages, CAP Theorem.

Types of Data: Structured, Unstructured Data and Semi-Structured Data.

NoSQL Database: Introduction, Need, Features.

Types of NoSQL Databases: Key-value store, document store, graph,


wide column stores, BASE Properties, Data Consistency model, ACID Vs
BASE, Comparative study of RDBMS and NoSQL.

MongoDB (with syntax and usage): CRUD Operations, Indexing,


Aggregation, MapReduce, Replication, Sharding.
Unit V
NoSQL Databases
Introduction to Distributed Databases
Distributed Systems
Data spread over multiple machines (also referred to as sites
or nodes.
Network interconnects the machines
Data shared by users on multiple machines
Distributed processing

The operations that occurs when an application distributes its


tasks among different computers in a network.

For example, a database application typically distributes


front-end presentation tasks to client computers and allows a
back-end database server to manage shared access to a
database.

Consequently, a distributed database application processing


system is more commonly referred to as a client/server
database application system.
A “Distributed Database”: is a logically interrelated collection of
shared data (and a description of this data), physically
distributed over a computer network.
A set of databases in a distributed system that can appear to
applications as a single data source.
A distributed database (DDB) is a collection of multiple, logically
interrelated databases distributed over a computer network.
A “Distributed DBMS” (DDBMS): is a Software system that
permits the management of the distributed database and
makes the distribution transparent to users.
Differentiate between local and global transactions

A local transaction accesses data in the single site at which
the transaction was initiated.

A global transaction either accesses data in a site different
from the one at which the transaction was initiated or
accesses data in several different sites.
Centralized DBMS on a Network

Site 1
Site 2

Site 5
Communication
Network

Site 4 Site 3
Distributed DBMS Environment

Site 1
Site 2

Site 5
Communication
Network

Site 4 Site 3
12.3 Introduction

Why use a DDBMS?


•Advantages: Disadvantages:
Reflects organizational structure •Complexity

•Improved shareability and •Cost

local autonomy •Security

•Improved availability •Integrity control more difficult

•Improved reliability •Lack of standards

•Improved performance •Lack of experience

•Economics •Database design more

•Modular growth complex


Architecture of Distributed Databases
A distributed database system allows applications to access
data from local and remote databases.

In a homogenous distributed database system, each database


is as same DBMS product.

e.g an Oracle database.

In a heterogeneous distributed database system, each


database is of Different DBMS product.

e.g at least one of the databases is a non-Oracle database.

Distributed databases use a client/server architecture to


process information requests.
Homogenous Distributed Database Systems

Same software/schema on all sites, data may be
partitioned among sites

Goal: provide a view of a single database, hiding
details of distribution
A homogenous distributed database system is a
network of two or more same databases that reside
on one or more machines.
Figure illustrates a distributed system that connects
three databases: hq, mfg, and sales.
An application can simultaneously access or modify
the data in several databases in a single distributed
environment.
Figure : Homogeneous Distributed Database
Figure : Homogeneous Database
Heterogeneous Distributed Database
Systems
In heterogeneous distributed database ,different sites may
use different schemas, and different database management
system software.
The sites may not be a aware of one another , and they may
provide only limited facilities for cooperation in transaction
processing .
Distributed Data Storage
Consider a relation r that is to be stored in database.
There are two approaches to storing this relation in
distributed database:
 Replication

System maintains multiple copies of data, stored in
different sites, for faster retrieval and fault tolerance.
 Fragmentation

Relation is partitioned into several fragments stored in
distinct sites
 Replication and fragmentation can be combined
Relation is partitioned into several fragments: system
maintains several identical replicas of each such fragment.
Data Replication
If relation is replicated , a copy of relation r is stored in two or
more sites.
The most extreme case , we have full replication , in which a
copy is stored in every site in the system.
Advantages of Replication

Availability: failure of site containing relation r
does not result in unavailability of r is replicas
exist.

Parallelism: queries on r may be processed by
several nodes in parallel.

Reduced data transfer: relation r is available
locally at each site containing a replica of r.
Disadvantages of Replication

Increased cost of updates: each replica of relation
r must be updated.


Increased complexity of concurrency control:
concurrent updates to distinct


replicas may lead to inconsistent data unless
special concurrency control mechanisms are
implemented.
One solution: choose one copy as primary copy and
apply concurrency control operations on primary copy.
Data Fragmentation
 If relation r is fragmented , r is divided into number of
fragments r1,r2,r3….rn.
 These fragments contain sufficient information to allow
reconstruction of the original relation r.
 Data can be distributed by storing individual tables at
different sites
 Data can also be distributed by decomposing a table and
storing portions at different sites – called Fragmentation
 There are two different schemes for fragmenting a relation:
 Horizontal Fragmentation
 Vertical Fragmentation
Horizontal Fragmentation
Horizontal fragmentation splits the relation by assigning each
tuple of r to one or more fragments.
Vertical fragmentation splits the relation by decomposing the
scheme R of relation r.
Consider the following schema
Account_scmema=(account_number,branch_name,balance)
In Horizontal fragmentation , a relation r is partitioned into a
number of subsets, r1,r2,r3….rn.
Each tuple of relation r must belong to at least one of the
fragments , so that the original relation can be
reconstructed,if needed .
The account relation can be divided into several different
fragments , each of which consists of tuples of accounts
belonging to particular branch.
consider in the banking system only two branches Hillside
and vallyview .
There are two different fragments.

Account1= branch_name=“Hillside”(account)

Account2= branch_name=“Valleyview”(account)
Horizontal Fragmentation Example

 A bank account schema has a relation


Account-schema = (branch-name, account-number, balance).
 It fragments the relation by location and stores each fragment locally:
rows with branch-name = `Hillside` are stored in the Hillside in a fragment
Vertical Fragmentation
Vertical fragmentation of r(R) involves the definition of several
subsets of attributes R1,R2,R3…….Rn of the schema R so that
R=R1 U R2 U R3 U…..U Rn
Each fragment ri of r is defined by
ri= πRi(r)
We can reconstruct relation r from the fragments by taking
the natural join

r = r1 r2 r3 …………………rn
Vertical Fragmentation Example

 A employee-info schema has a relation


employee-info schema = (designation, name,
Employee-id, salary).
 It fragments the relation to put information in two
tables for security concern.
Transparency
The user of a distributed database system should not be
required to know either where the data are physically located
or how the data can be accessed at the specific local site.
This characteristic called data transparency .

Fragmentation transparency
Replication transparency
Location transparency
Fragmentation transparency:
Users are not required to know how a relation has
been fragmented.

Replication transparency
Users view each data object as logically unique.
The distributed system may replicate an object to
increase either system performance or data
availability.

Location transparency
User are not required to know the physical location of
the data.
Types of Data
Big Data includes huge volume, high velocity, and extensible variety of
data. These are 3 types: Structured data, Semi-structured data, and
Unstructured data.
Structured data –
Structured data is data whose elements are addressable for effective
analysis.
It has been organized into a formatted repository that is typically a
database.
It concerns all data which can be stored in database SQL in a table with
rows and columns.
They have relational keys and can easily be mapped into pre-designed
fields.
Today, those data are most processed in the development and simplest
way to manage information. Example: Relational data
Semi-Structured data –
Semi-structured data is information that does not reside in a relational
database but that has some organizational properties that make it easier
to analyze. With some processes, you can store them in the relation
database (it could be very hard for some kind of semi-structured data),
but Semi-structured exist to ease space. Example: XML data

Unstructured data –
Unstructured data is a data which is not organized in a predefined manner
or does not have a predefined data model, thus it is not a good fit for a
mainstream relational database. So for Unstructured data, there are
alternative platforms for storing and managing, it is increasingly prevalent
in IT systems and is used by organizations in a variety of business
intelligence and analytics applications. Example: Word, PDF, Text, Media
logs.
Properties Structured data Semi-structured data Unstructured data

It is based on
It is based on Relational It is based on character
Technology XML/RDF(Resource
database table and binary data
Description Framework).

Matured transaction and No transaction


Transaction is adapted
Transaction management various concurrency management and no
from DBMS not matured
techniques concurrency

Versioning over Versioning over tuples or


Version management Versioned as a whole
tuples,row,tables graph is possible

It is more flexible than


It is more flexible and
It is schema dependent structured data but less
Flexibility there is absence of
and less flexible flexible than unstructured
schema
data

It is very difficult to scale It’s scaling is simpler than


Scalability It is more scalable.
DB schema structured data

New technology, not very


Robustness Very robust —
spread

Structured query allow Queries over anonymous Only textual queries are
Query performance
complex joining nodes are possible possible
NoSQL

“Not Only SQL”


SQL JOIN

An SQL JOIN clause combines records from two or more tables in a


database.

It creates a set that can be saved as a table or used as it is.

A JOIN is a means for combining fields from two tables by using values
common to each.

ANSI-standard SQL specifies four types of JOIN:

INNER, OUTER, LEFT, and RIGHT.

As a special case, a table (base table, view, or joined table) can JOIN to
itself in a self-join.

One of the most complex SQL operations.


Horizontal and Vertical Scaling - Scaling Resources Up

•Horizontal scaling means that you scale by adding more machines


to your pool of resources.
•Whereas Vertical scaling means that you scale by adding more
power (CPU, RAM, etc.) to your existing machine.

•In the database world, horizontal scaling is often based on


partitioning of the data i.e. splitting up the data so that each
machine contains only part of the data.
•While in vertical scaling the data resides on a single node and
scaling is done through the use of multi-cores, etc. i.e. spreading
the load between the CPU and RAM resources of that single
machine.
Good examples for horizontal scaling are the Cloud data stores, e.g.
DynamoDB, Cassandra , MongoDB …

A good example for vertical scaling is MySQL - Amazon RDS (The


cloud version of MySQL) provides an easy way to scale vertically by
switching from small to bigger machines.
ACID

ACID is an acronym for a set of properties that you would


like to hold when modifying a database via a transaction,
i.e. a group of related changes to the database.

Atomicity

Consistency

Isolation

Durability
Atomicity means that you can guarantee that all of a transaction
happens, or none of it does.

Consistency means that you can guarantee that your data will be
consistent.

Isolation means that one transaction cannot read data from another
transaction that is not yet completed.

Durability means that once a transaction is complete, it is


guaranteed that all of the changes have been recorded to a durable
medium (such as a hard disk),
What is NoSQL?

“NoSQL is the term used to designate database


management systems that differ from traditional
relational database management systems in some way.

These data stores may not require fixed-table schemas,


and usually avoid join operations and typically scale
horizontally.”

“Non-relational” may be more accurate term than


“NoSQL”, as some NoSQL DBs do support a subset of SQL.
NoSQL is a non-relational database management systems, different
from traditional relational database management systems in some
significant ways.

It is designed for distributed data stores where very large scale of


data storing needs to be available (For example, Google or Facebook
which collects terabits of data every day for their users).

These type of data storing may not require fixed schema, avoid join
operations and typically scale horizontally.

NoSQL databases are sometimes referred to as cloud databases,


non-relational databases, Big Data databases.
Today, data is becoming easier
to access and capture through
third parties such as Facebook,
Google+ and others.

Personal user information,


social graphs, geo-location data,
user-generated content and
machine logging data are just a
few examples where the data
has been increasing
exponentially.

To use the above services properly requires the processing of huge


amounts of data. Which SQL databases are no good for, and were
never designed for.

NoSQL databases have evolved to handle this huge data


properly.
RDMS has a problem with Unstructured or Semi-Structured Data
Example
A typical traditional, structured, table-based relational database

Fixed number of fields for each record – highly structured


CAP Theorem

You need to understand the CAP theorem when you talk about
NoSQL databases, or, in fact, when designing any distributed
system.

The CAP theorem states that there are three basic requirements
which exist in a special relation when designing applications for a
distributed architecture.
Consistency - This means that the data in the database remains
consistent after the execution of an operation.
For example, after an update operation, all clients see the same data.

Availability - This means that the system is always on (Service


guarantee availability), no downtime.

Partition Tolerance - This means that the system continues to


function even if the communication among the servers is unreliable,
i.e. the servers may be partitioned into multiple groups
Generally, it is not be possible to fulfill all three requirements in a
distributed system.

CAP provides the basic requirements for a distributed system to


follow two of the three requirements.

Distributed systems must be partition tolerant (P), so we have to


choose between Consistency and Availability.

Current NoSQL databases follow the different combinations of C and


A from the CAP theorem.
In general, NoSQL databases have become the first alternative
to relational databases, with scalability, availability, and fault
tolerance being key deciding factors.

A very flexible and schema-less data model, horizontal


scalability, distributed architectures, and the use of languages
and interfaces that are “not only” SQL typically characterize
this technology.
Types of NoSQL
There are four general types of NoSQL databases, each with their own specific
attributes:

Key-Value store – we start with this type of database because these are some
of the least complex NoSQL options.
These databases are designed for storing data in a schema-less way.
In a key-value store, all of the data within consists of an indexed key and a
value, hence the name.
Examples of this type of database include: Cassandra, DyanmoDB, Azure Table
Storage (ATS), Riak, BerkeleyDB.

Column store – (also known as wide-column stores) instead of storing data in


rows, these databases are designed for storing data tables as sections of
columns of data, rather than as rows of data.
While this simple description sounds like the inverse of a standard database,
wide-column stores offer very high performance and a highly scalable
architecture.
Examples include: HBase, BigTable and HyperTable
Column-based Stores ( Wide-Column Stores)

What is a column-based store? - Data tables are stored as sections


of columns of data, rather than as rows of data.
Document database – expands on the basic idea of key-value
stores where “documents” contain more complex in that they
contain data and each document is assigned a unique key, which
is used to retrieve the document.
These are designed for storing, retrieving, and
managing document-oriented information, also known as semi-
structured data.
Examples include: MongoDB and CouchDB.

Graph database – Based on graph theory, these databases are


designed for data whose relations are well represented as a graph
and has elements which are interconnected, with an
undetermined number of relations between them.
Examples include: Neo4J and Polyglot.
Graph Database Systems


Data model: nodes and edges


Nodes may have properties (including ID)


Edges may have labels or roles
Graph Database

Apply graph theory in the storage of information about the


relationship between entries

A graph database is a database that uses graph structures with


nodes, edges, and properties to represent and store data.

By definition, a graph database is any storage system that provides


index-free adjacency.

- This means that every element contains a direct pointer to its


adjacent element and no index lookups are necessary.

In general, graph databases are useful when you are more interested in
relationships between data than in the data itself:
for example, in representing and traversing social networks, generating
recommendations, or conducting forensic investigations (e.g. pattern
detection).
Types of NoSQL Databases
A database is a collection of structured data or information
which is stored in a computer system and can be accessed
easily.
A database is usually managed by a Database Management
System (DBMS).
NoSQL is a non-relational database that is used to store the
data in the nontabular form.
NoSQL stands for Not only SQL.
The main types are documents, key-value, wide-column, and
graphs.
Types of NoSQL Database:

Document-based databases

Key-value stores

Column-oriented databases

Graph-based databases
Document-Based Database
The document-based database is a nonrelational database. Instead of storing the data in rows and
columns (tables), it uses the documents to store the data in the database. A document database
stores data in JSON, BSON, or XML documents.
Documents can be stored and retrieved in a form that is much closer to the data objects used in
applications which means less translation is required to use these data in the applications. In the
Document database, the particular elements can be accessed by using the index value that is
assigned for faster querying.
Collections are the group of documents that store documents that have similar contents. Not all the
documents are in any collection as they require a similar schema because document databases have
a flexible schema.
Key features of documents database:
Flexible schema: Documents in the database has a flexible schema. It means the documents in the
database need not be the same schema.
Faster creation and maintenance: the creation of documents is easy and minimal maintenance
is required once we create the document.
No foreign keys: There is no dynamic relationship between two documents so documents can be
independent of one another. So, there is no requirement for a foreign key in a document database.
Open formats: To build a document we use XML, JSON, and others.
Key-Value Stores
A key-value store is a nonrelational database. The simplest form
of a NoSQL database is a key-value store. Every data element in
the database is stored in key-value pairs. The data can be
retrieved by using a unique key allotted to each element in the
database. The values can be simple data types like strings and
numbers or complex objects.
A key-value store is like a relational database with only two
columns which is the key and the value.
Key features of the key-value store:
Simplicity.
Scalability.
Speed.
Column Oriented Databases
A column-oriented database is a non-relational database that
stores the data in columns instead of rows. That means when
we want to run analytics on a small number of columns, you
can read those columns directly without consuming memory
with the unwanted data.
Columnar databases are designed to read data more efficiently
and retrieve the data with greater speed. A columnar database
is used to store a large amount of data. Key features of
columnar oriented database:
Scalability.
Compression.
Very responsive.
Graph-Based databases
Graph-based databases focus on the relationship between
the elements. It stores the data in the form of nodes in the
database. The connections between the nodes are called links
or relationships.
Key features of graph database:
In a graph-based database, it is easy to identify the
relationship between the data by using the links.
The Query’s output is real-time results.
The speed depends upon the number of relationships among
the database elements.
BASE: Basically Available, Soft state, Eventual
consistency
Basically, available means DB is available all
the time as per CAP theorem
Soft state means even without an input; the
system state may change
Eventual consistency means that the system
will become consistent over time
Difference between ACID and BASE
Criteria ACID BASE
Simplicity Simple Complex

Focus Commits Best attempt

Maintenance High Low

Consistency Of Data Strong Weak/Loose

Concurrency scheme Nested Transactions Close to answer

Scaling Vertical Horizontal

Implementation Easy to implement Difficult to implement

Upgrade Harder to upgrade Easy to upgrade

Type of database Robust Simple

Type of code Simple Harder

Time required for completion Less time More time.

Oracle, MySQL, SQL Server, DynamoDB, Cassandra,


Examples
etc. CouchDB, SimpleDB etc.
RDBMS versus NoSQL

RDBMS
- Structured and organized data
- Structured query language (SQL)
- Data and its relationships are stored in separate tables.
- Data Manipulation Language, Data Definition Language
- Tight Consistency
- ACID Transaction

NoSQL
- Stands for Not Only SQL
- No declarative query language
- No predefined schema
- Variants - Key-Value Pair Store, Column Store, Document Store, Graph
Store
- Eventual consistency rather ACID property
- Unstructured and unpredictable data
- CAP Theorem
- Prioritizes high performance, high availability
66 and scalability
key features of NoSQL:

1. Scale horizontally “simple operations”


– key lookups, reads and writes of one record or a small
number of records, simple selections

2. Replicate/distribute data over many servers

3. Weaker concurrency model than ACID

4. Efficient use of distributed indexes and RAM

5. Flexible schema
Different Types of NoSQL Systems

• Distributed Key-Value Systems - Lookup a single value for a key


– Amazon’s Dynamo

• Document-based Systems - Access data by key or by search of


“document” data.
– CouchDB
– MongoDB

• Column-based Systems
– Google’s BigTable
– Facebook’s Cassandra

• Graph-based Systems - Use a graph structure


- Google’s Pregel
- Neo4j
Introduction To MangoDB
MongoDB is a cross-platform, document oriented database that provides, high performance,
high availability, and easy scalability.
MongoDB works on concept of collection and document.
It is NoSQL database.
Database
Database is a physical container for collections. Each database gets its own set of files on the
file system.
A single MongoDB server typically has multiple databases.
Collection
Collection is a group of MongoDB documents.
It is the equivalent of an RDBMS table.
A collection exists within a single database.
Collections do not enforce a schema.
Documents within a collection can have different fields.
Typically, all documents in a collection are of similar or related purpose.
Document
A document is a set of key-value pairs.
Documents have dynamic schema.
Dynamic schema means that documents in the same collection do not need to have the same
set of fields or structure, and common fields in a collection's documents may hold different
types of data.
Advantages of MongoDB over RDBMS
Schema less : MongoDB is document database in
which one collection holds different different
documents.
Number of fields, content and size of the document
can be differ from one document to another.
No complex joins
MongoDB supports dynamic queries on documents
using a document-based query language that's nearly
as powerful as SQL
Ease of scale-out: MongoDB is easy to scale
Why should use MongoDB
Document Oriented Storage : Data is stored
in the form of JSON(JavaScript Object
Notation) style documents
Index on any attribute
Replication & High Availability
Rich Queries
Fast In-Place Updates
Professional Support By MongoDB
Where should use MongoDB?
Big Data
Content Management and Delivery
Mobile and Social Infrastructure
User Data Management
Data Hub

MongoDB provides rich semantics for reading and


manipulating data.
CRUD stands for Create, Read, Update, and Delete.
These terms are the foundation for all interactions
with the database.
A collection of MongoDB documents.
The stages of a MongoDB query with a query criteria and a sort modifier.
SQL Vs MongoDB
SQL Concepts MongoDB Concepts

Database Database
Table Collection
Row Document
Column Field
Index Index
Table Join Embedded documents & Linking
Primary key Primary Key
Specify any unique column or column In MongoDB, the primary key is
combination as primary key. automatically set to the _id field.

aggregation (e.g. group by) aggregation pipeline


A MongoDB instance may have zero or more
databases

A database may have zero or more ‘collections’.

A collection may have zero or more ‘documents’.

A document may have one or more ‘fields’.


MongoDB Create Database
The use command

use database_name

To check your currently selected database


E.g : use kbs;
Then create collection in kbs.
Then show dbs it shows kbs in list.
db

If you want to check your databases list, then use the command

show dbs
In MongoDB, you don't need to create collection. MongoDB
creates collection automatically, when you insert some
document
>db.tutorialspoint.insert({"name" : "tutorialspoint"})
>show collections
mycol
mycollection
system.indexes
tutorialspoint

Firstly, switch to database in which you want create a


collection (use Database_Name)

If you want to check your collections list, then use the


command

show collections
MongoDB Drop Database

To drop the database

Db.dropDatabase()

If you want to delete any database, do


remember first of all switch to the database
that you want delete and then execute
db.dropDatabase() command
MongoDB Datatypes
MongoDB Datatypes
RDBMS Where Clause Equivalents in
MongoDB
AND in MongoDB

In the find() method if you pass multiple keys


by separating them by ',' then MongoDB
treats it AND condition.

Basic syntax of AND is shown below:

>db.mycol.find({key1:value1, key2:value2})
OR in MongoDB
To query documents based on the OR condition,
you need to use $or keyword.

Basic syntax of OR is shown below:

>db.mycol.find({$or: [{key1: value1}, {key2:value2}]})


MongoDB-Update Document

The update() method updates values in the existing document.

>db.COLLECTION_NAME.update(SELECTIOIN_CRITERIA, UPDATED_DATA)
Example:

>db.mycol.update({'title':'MongoDB Overview'},{$set:{'title':'New MongoDB


Tutorial'}})

By default MongoDB will update only single document, to


update multiple you need to set a parameter 'multi' to true.

>db.mycol.update({'title':'MongoDB Overview'},{$set:
{'title':'New MongoDB Tutorial'}},{multi:true})
MongoDB-Delete Document

MongoDB's remove() method is used to remove document


from the collection.
remove() method accepts two parameters. One is deletion
criteria and second is just One flag
1. deletion criteria : (Optional) deletion criteria according
to documents will be removed.
2. justOne : (Optional) if set to true or 1, then remove only
one document.

Basic syntax of remove() method is as follows:

>db.COLLECTION_NAME.remove(DELLETION_CRITTERIA)
MongoDB-Delete Document

Remove only one document

If there are multiple records and you want to delete only first
record, then set justOne parameter in remove() method

>db.COLLECTION_NAME.remove(DELETION_CRITERIA,1)

Remove All documents

If you don't specify deletion criteria, then mongodb will


delete whole documents from the collection. This is
equivalent of SQL's truncate command.
>db.mycol.remove()
>db.mycol.find()
MongoDB-Sort Documents
To sort documents in MongoDB, you need to use sort() method.

sort() method accepts a document containing list of fields along


with their sorting order.

To specify sorting order 1 and -1 are used.

1 is used for ascending order while -1 is used for descending order.

Basic syntax of sort() method is as follows:

>db.COLLECTION_NAME.find().sort({KEY:1})
MongoDB Indexing
Indexes are special data structures, that store a small portion of the
data set in an easy to traverse form.

Indexes support the efficient resolution of queries.

Without indexes, MongoDB must scan every document of a collection


to select those documents that match the query statement.

The index stores the value of a specific field or set of fields, ordered by
the value of the field as specified in index.

The ensureIndex() Method

To create an index you need to use ensureIndex() method of mongodb.


>db.COLLECTION_NAME.ensureIndex({KEY:1})
MongoDB Aggregation
The aggregate() Method

For the aggregation in MongoDB you should use aggregate()


method.

Basic syntax of aggregate() method is as follows

>db.COLLECTION_NAME.aggregate(AGGREGATE_OPERATION)
MongoDB Aggregate Functions
$sum:
Sums up the defined value from all documents in the collection.
db.mycol.aggregate([{$group : {_id: "$by_user", num_tutorial :
{$sum:"$likes"}}}])
$avg:
Calculates the average of all given values from all documents in the
collection.
db.mycol.aggregate([{$group : {_id: "$by_user", num_tutorial :
{$avg:"$likes"}}}])
$min:
Gets the minimum of the corresponding values from all documents in the
collection.
db.mycol.aggregate([{$group : {_id: "$by_user", num_tutorial :
{$min:"$likes"}}}])
$max:
Gets the maximum of the corresponding values from all documents in
the collection.
db.mycol.aggregate([{$group : {_id: "$by_user", num_tutorial :
{$max:"$likes"}}}])
MongoDB Sharding

Sharding is the process of storing data records across


multiple machines and it is MongoDB's approach to
meeting the demands of data growth.

As the size of the data increases, a single machine may


not be sufficient to store the data nor provide an
acceptable read and write throughput. Sharding solves
the problem with horizontal scaling.

With sharding, you add more machines to support data


growth and the demands of read and write operations.
Why Sharding ?

In replication all writes go to master node

Latency sensitive queries still go to master

Single replica set has limitation of 12 nodes

Memory can't be large enough when active dataset is


big

Local Disk is not big enough

Vertical scaling is too expensive


SQL & Mongodb Commands

SQL SELECT Statements MongoDB find() Statements


SELECT * FROM Teacher_info; db.Teacher_info.find()

SELECT * FROM Teacher_info WHERE sal = db.Teacher_info.find( {sal: 25000})


25000;

SELECT Teacher_id FROM Teacher_info db.Teacher_info.find( {Teacher_id:


WHERE Teacher_id = 1; "pic001"})
SQL & Mongodb Commands
SELECT * FROM Teacher_info WHERE db.Teacher_info.find({status:
status != "A“; {$ne:"A"}})

SELECT * FROM Teacher_info WHERE db.Teacher_info.find({status:"A",


status = "A" AND sal = 20000; sal:20000})

SELECT * FROM Teacher_info WHERE > db.Teacher_info.find( { $or: [ { status:


status = "A" OR sal = 50000; "A" } , { sal:50000 } ] } )

SELECT * FROM Teacher_info WHERE db. Teacher_info.find( { sal: { $gt: 40000


sal > 40000 }})

SELECT * FROM Teacher_infoWHERE db. Teacher_info.find( { sal: { $gt: 30000


sal < 30000 }})
SQL & Mongodb Commands
SELECT * FROM Teacher_info db. Teacher_info.find( { status:
WHERE status = "A" ORDER BY SAL "A" } ).sort( { sal: 1 } )
ASC
SELECT * FROM users WHERE status db. Teacher_info.find( { status:
= "A" ORDER BY SAL DESC "A" } ).sort( {sal: -1 } )

SELECT COUNT(*) FROM db. Teacher_info.count()


Teacher_info; or
db. Teacher_info.find().count()

SELECT DISTINCT(Dept_name) db.


FROM Teacher_info; Teacher_info.distinct( “Dept_name" )
Update Records

UPDATE Teacher_info SET Dept_name = db. Teacher_info.update( { sal: { $gt:


“ETC" WHERE sal > 250000 25000 } }, { $set: { Dept_name:
“ETC" } }, { multi: true } )

UPDATE Teacher_infoSET sal = sal + db. Teacher_info.update( { status:


10000 WHERE status = "A" "A" } , { $inc: { sal: 10000 } }, { multi:
true } )
Delete Records

DELETE FROM Teacher_info WHERE db.Teacher_info.remove({Teacher_id:


Teacher_id = “pic001" "pic001"});

DELETE FROM Teacher_info; db. Teacher_info.remove({})


Find and findOne
The basic difference between findOne() and
find()

findOne()-if query matchs ,first document it


returned ,otherwise null

Find ():- nomatter number of documents


matched,a cursor is returned ,never null.
Limits, Skips, and Sorts
The most common query options are, skipping a number of results, and sorting.

Limits:- limiting the number of results returned


To set a limit, chain the limit function onto your call to find.
For example, to only return three results, use this:

> db.c.find().limit(3)

If there are fewer than three documents matching your query in the collection,
only the number of matching documents will be returned; limit sets an upper
limit, not a lower limit.
Skip :-skip works similarly to limit:
> db.c.find().skip(3)

This will skip the first three matching documents and return the rest of the
matches. If there are fewer than three documents in your collection, it will not
return any documents.
Sort
To sort documents in MongoDB, you need to use sort()
method.

sort() method accepts a document containing list of


fields along with their sorting order.

To specify sorting order 1 and -1 are used.

1 is used for ascending order while -1 is used for


descending order.

Basic syntax of sort() method is as follows:

>db.COLLECTION_NAME.find().sort({KEY:1})
> db.student.find();
{ "_id" : ObjectId("59798f8ff04853461a205087"), "name" : "abc", "pin" : 1213331}
{ "_id" : ObjectId("5982e590f6bd1b6d87588a1a"), "name" : "pqr", "pin" : 78776787 }
{ "_id" : ObjectId("5982e5bdf6bd1b6d87588a1b"), "name" : "xyz", "pin" : 6787, “addrss" : "pune" }
{ "_id" : ObjectId("5982e5c6f6bd1b6d87588a1c"), "name" : "xyz", "pin" : 6787, “addrss" : "mumbai"
}
{ "_id" : ObjectId("5982e5dcf6bd1b6d87588a1d"), "name" : "umesh", "pin" : 5654787, "addrss" :
"Nagpure" }

Limits
> db.student.find().limit(3);
{ "_id" : ObjectId("59798f8ff04853461a205087"), "name" : "abc", "pin" : 1213331}
{ "_id" : ObjectId("5982e590f6bd1b6d87588a1a"), "name" : "pqr", "pin" : 78776787 }
{ "_id" : ObjectId("5982e5bdf6bd1b6d87588a1b"), "name" : "xyz", "pin" : 6787, “addrss" : "pune" }

Skips
> db.student.find().skip(3);
{ "_id" : ObjectId("5982e5c6f6bd1b6d87588a1c"), "name" : "xyz", "pin" : 6787, “addrss" : "mumbai"
}
{ "_id" : ObjectId("5982e5dcf6bd1b6d87588a1d"), "name" : "umesh", "pin" : 5654787, "addrss" :
"Nagpure" }
>
Sort
> db.student.find().sort({name:1});
{ "_id" : ObjectId("59798f8ff04853461a205087"), "name" : "abc", "pin" : 1213331}
{ "_id" : ObjectId("5982e590f6bd1b6d87588a1a"), "name" : "pqr", "pin" : 78776787 }
{ "_id" : ObjectId("5982e5dcf6bd1b6d87588a1d"), "name" : "umesh", "pin" : 5654787,
"addrss" : "Nagpure" }
{ "_id" : ObjectId("5982e5bdf6bd1b6d87588a1b"), "name" : "xyz", "pin" : 6787, “addrss" :
"pune" }
{ "_id" : ObjectId("5982e5c6f6bd1b6d87588a1c"), "name" : "xyz", "pin" : 6787, “addrss" :
"mumbai" }
>
> db.student.find().sort({name:-1});
{ "_id" : ObjectId("5982e5bdf6bd1b6d87588a1b"), "name" : "xyz", "pin" : 6787, “addrss" :
"pune" }
{ "_id" : ObjectId("5982e5c6f6bd1b6d87588a1c"), "name" : "xyz", "pin" : 6787, “addrss" :
"mumbai" }
{ "_id" : ObjectId("5982e5dcf6bd1b6d87588a1d"), "name" : "umesh", "pin" : 5654787,
"addrss" : "Nagpure" }
{ "_id" : ObjectId("5982e590f6bd1b6d87588a1a"), "name" : "pqr", "pin" : 78776787 }
{ "_id" : ObjectId("59798f8ff04853461a205087"), "name" : "abc", "pin" : 1213331}
> db.student.find().sort({name:-1,addrss:1});
{ "_id" : ObjectId("5982e5c6f6bd1b6d87588a1c"), "name" : "xyz", "pin" :
6787, “addrss" : "mumbai" }
{ "_id" : ObjectId("5982e5bdf6bd1b6d87588a1b"), "name" : "xyz", "pin" :
6787, “addrss" : "pune" }
{ "_id" : ObjectId("5982e5dcf6bd1b6d87588a1d"), "name" : "umesh", "pin" :
5654787, "addrss" : "Nagpure" }
{ "_id" : ObjectId("5982e590f6bd1b6d87588a1a"), "name" : "pqr", "pin" :
78776787 }
{ "_id" : ObjectId("59798f8ff04853461a205087"), "name" : "abc", "pin" :
1213331}
Combination of limits-Skips-Sort
> db.student.find().limit(3).sort({"name":1});

{ "_id" : ObjectId("59798f8ff04853461a205087"), "name" : "abc", "pin" :


1213331}
{ "_id" : ObjectId("5982e590f6bd1b6d87588a1a"), "name" : "pqr", "pin" :
78776787 }
{ "_id" : ObjectId("5982e5dcf6bd1b6d87588a1d"), "name" : "umesh",
"pin" : 5654787, "addrss" : "Nagpure" }
Introduction To Map Reduce
MapReduce is a programming model and an associated
implementation for processing and generating large data sets with
a parallel, distributed algorithm on a cluster.

A MapReduce program is composed of a Map() procedure that


performs filtering and sorting (such as sorting students by first
name into queues, one queue for each name)

and a Reduce() procedure that performs a summary operation


(such as counting the number of students in each queue, yielding
name frequencies).
Map-reduce is a data processing paradigm for condensing large
volumes of data into useful aggregated results.

For map-reduce operations, MongoDB provides the mapReduce


database command.
Consider the following map-reduce operation
In this map-reduce operation, MongoDB applies
the map phase to each input document (i.e. the documents in
the collection that match the query condition).
The map function emits key-value pairs.
For those keys that have multiple values, MongoDB applies
the reduce phase, which collects and condenses the
aggregated data.
MongoDB then stores the results in a collection.
Optionally, the output of the reduce function may pass
through a finalize function to further condense or process the
results of the aggregation.
mapReduce can return the results of a map-reduce operation
as a document, or may write the results to collections.
How Map Reduce works
Input map to the mapper:
item1 -> bread, item2 -> cucumber, item3 -> green pepper, item4 ->
tomato, item5 -> lettuce, item6 -> onion
Output map of the mapper:
item1 -> sliced bread, , item2 -> sliced cucumber, item3 -> chopped
green pepper, item4 -> sliced tomato, item5 -> chopped lettuce,
item6 -> sliced onion
Output map of the reducer:
vegi subs
The map function is a JavaScript function that associates or “maps” a value
with a key and emits the key and value pair during a map-reduce operation.

Ex:
Prob: A king want to count the total population in his country.

Solution: 1. He can send one person to count the population.


The assigned person will visit every city serially and return with the total
population in the country.

Solution 2: king sends one person to each city. They will count the
population in each city and after returning to kingdom the total count will
reduce to single count(By adding population of each city).

Here all persons are counting the population in every city.


First solution is serial and second is parallel solution.

Assuming there are 10 cities in the kingdom, if it takes one day to count the
population in each city then counting the total population takes 10 days in
solution 1 and only one day in solution 2.
In map Reduce we have to write 3 functions.

1. Map Function(Ex.Person to each city to count


population).
2.Reduce Function(Ex. Reducing the total population
count to single value.)
3. Map Reduce Function( it will create a new collection
it contains the total population)
Map Reduce
db.createCollection(“orders”)

({cust_id: "abc123“,ord_date: new Date("Oct 04, 2012“),price:25})


({cust_id:"xyz123",ord_date:new Date("Oct 05,2012"),price:50})
({cust_id:"xyz456",ord_date:new Date("Oct 06,2012"),price:60})
({cust_id:"xyz789",ord_date:new Date("Oct 07,2012"),price:70})
({cust_id:"xyz789",ord_date:new Date("Oct 07,2012"),price:90})
({cust_id:"xyz789",ord_date:new Date("Oct 07,2012"),price:90})

This collection contains a customer id, order date


and price.
Now we can write a map Reduce function it will
return the Total Price Per Customer.
Step 1: Map

var mapFunction1 = function() { emit(this.cust_id, this.price);};

Map function to process each input document:


In the function, this refers to the document that the
map-reduce operation is processing.

The function maps the price to the cust_id for each


document and emits the cust_id and price pair.
Step 2: Reduce
var reduceFunction1 = function(keyCustId, valuesPrices)
{ return Array.sum(valuesPrices); };

Define the corresponding reduce function with two


arguments keyCustId and valuesPrices:

The valuesPrices is an array whose elements are the price


values emitted by the map function and grouped by
keyCustId.

The function reduces the valuesPrice array to the sum of its


elements.
Step 3: Map Reduce
db.orders.mapReduce(
mapFunction1,
reduceFunction1,
{ out: "map_example" }
)

Perform the map-reduce on all documents in the orders


collection using the mapFunction1 map function and
the reduceFunction1 reduce function.
This operation outputs the results to a collection named
map_example.
If the map_example collection already exists, the
operation will replace the contents with the results of
this map-reduce operation:
Return the Total Price Per Customer
var mapFunction1 = function() {
emit(this.cust_id, this.price);
};

var reduceFunction1 = function(keyCustId, valuesPrices) { return Array.sum(valuesPrices);


};

db.orders.mapReduce(
mapFunction1,
reduceFunction1,
{ out: "map_example" }
)

db.map_example.find();
Create a collection orders.
Insert following documents in orders collection
{ "cust_id" : "abc123", "ord_date“ : new Date("Oct 04,2012"), "price" : 25 }
{ "cust_id" : "xyz123", "ord_date” : new Date("Oct 05,2012"), "price" : 50 }
{ "cust_id" : "xyz456", "ord_date” : new Date("Oct 06,2012"), "price" : 60 }
{ "cust_id" : "xyz789", "ord_date” : new Date("Oct 07,2012“), "price" : 70 }
{ "cust_id" : "xyz789", "ord_date” : new Date("Oct 06,2012"),"price" : 90 }
{ "cust_id" : "xyz789", "ord_date” : new Date("Oct 06,2012"),"price" : 90 }
{ "cust_id" : "abc123", "ord_date” : new Date("Oct 06,2012"),"price" : 90 }

Create map function as follows


var mapFunction1 = function() {
emit(this.cust_id, this.price); };

Create reduce function as follows

var reduceFunction1 = function(keyCustId, valuesPrices) { return


Array.sum(valuesPrices); };
Create map reduce function as follows
db.orders.mapReduce(
mapFunction1,reduceFunction1,
{ out: "map_example" } )

db.orders.mapReduce(mapFunction1,reduceFunction1,
{out:"map_example"})
{
"result" : "map_example",
"timeMillis" : 75,
"counts" : {
"input" : 7,
"emit" : 7,
"reduce" : 2,
"output" : 4
},
"ok" : 1,
}
Find documents from map_example as follows
db.map_example.find()
it gives the Total Price Per Customer

db.map_example.find()
E.g
{ "_id" : "abc123", "value" : 115 }
{ "_id" : "xyz123", "value" : 50 }
{ "_id" : "xyz456", "value" : 60 }
{ "_id" : "xyz789", "value" : 250 }
End of Unit No : 5

You might also like