0% found this document useful (0 votes)
71 views9 pages

RDBMS: Atomic Consistent Isolated Durable

RDBMS refers to relational database management systems. RDBMS uses structured query language, has a structured and organized schema, and stores data and relationships across multiple tables. NoSQL databases are not only SQL, have no predefined schema, support flexible and unpredictable data structures, and prioritize high performance, availability, and scalability over strict consistency models. Common NoSQL databases include key-value stores like Redis and DynamoDB, document databases like MongoDB, and column-oriented databases like Cassandra and HBase.

Uploaded by

dikshant gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views9 pages

RDBMS: Atomic Consistent Isolated Durable

RDBMS refers to relational database management systems. RDBMS uses structured query language, has a structured and organized schema, and stores data and relationships across multiple tables. NoSQL databases are not only SQL, have no predefined schema, support flexible and unpredictable data structures, and prioritize high performance, availability, and scalability over strict consistency models. Common NoSQL databases include key-value stores like Redis and DynamoDB, document databases like MongoDB, and column-oriented databases like Cassandra and HBase.

Uploaded by

dikshant gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

RDBMS

Hadoop @ SunBeam Infotech


▪ Structured and organized data
▪ Structured query language (SQL)
▪ DML, DQL, DDL, DTL, DCL.
▪ Data and its relationships are stored in separate tables.
▪ Tight Consistency
Sunbeam Infotech ▪ Based on Codd’s rules
▪ ACID transactions.
▫ Atomic
Mongo Db ▫ Consistent
▫ Isolated
▫ Durable

# Big Data - Hadoop Developer @ Sunbeam Infotech $ Nilesh Ghule. 2

NoSQL Scaling
Hadoop @ SunBeam Infotech

Hadoop @ SunBeam Infotech


▪ Stands for Not Only SQL ▪ Scalability is the ability of a system to expand to meet your business needs.
▪ No declarative query language ▪ E.g. scaling a web app is to allow more people to use your application.
▪ No predefined schema, Unstructured and unpredictable data ▪ Vertical scaling: Add resources within the same logical unit to increase
▪ Eventual consistency rather ACID property capacity. E.g. add CPUs to an existing server, increase memory in the
▪ Based on CAP Theorem system or expanding storage by adding hard drives.
▪ Prioritizes high performance, high availability and scalability ▪ Horizontal scaling: Add more nodes to a system. E.g. adding a new
▪ BASE Transaction computer to a distributed software application. Based on principle of
▫ Basically Available distributed computing.
▫ Soft state ▪ NoSQL databases are designed for Horizontal scaling. So they are reliable,
▫ Eventual consistency fault tolerant, better performance (at lower cost), speed.

# Big Data - Hadoop Developer @ Sunbeam Infotech $ Nilesh Ghule. 3 # Big Data - Hadoop Developer @ Sunbeam Infotech $ Nilesh Ghule. 4
CAP Theorem NoSQL scenarios
Hadoop @ SunBeam Infotech

Hadoop @ SunBeam Infotech


▪ Consistency - Data is
consistent after operation. ▪ When to use NoSQL? ▪ When Not to use NoSQL?
▫ Large amount of data (TBs) ▫ Need ACID transactions
After an update operation, all ▫ Many Read/Write ops ▫ Fixed multiple relations
clients see the same data. ▫ Economical Scaling ▫ Need joins
▪ Availability - System is ▫ Flexible schema ▫ Need high consistency
always on (i.e. service ▪ Examples: ▪ Examples
guarantee), no downtime. ▫ Social media ▫ Financial transactions
▪ Partition Tolerance - System ▫ Recordings ▫ Business operations
continues to function even ▫ Geospatial analysis
the communication among ▫ Information processing
the servers is unreliable.
# Big Data - Hadoop Developer @ Sunbeam Infotech $ Nilesh Ghule. 5 # Big Data - Hadoop Developer @ Sunbeam Infotech $ Nilesh Ghule. 6

NoSQL Advantages/Problems NoSQL Categories


Hadoop @ SunBeam Infotech

Hadoop @ SunBeam Infotech


▪ Advantages: ▪ Key-value databases - e.g. redis, dynamodb, riak, ...
▫ High scalability ▫ Based on Amazon’s Dynamo database.
▫ Distributed computing ▫ For handling huge data of any type.
▫ Lower cost ▫ Keys are unique and values can be of any type i.e. JSON, BLOB, etc.
▫ Flexible schema/Semi structured ▫ Implemented as big distributed hash-table for fast searching.
▫ No complex relationships
▪ Disadvantages:
▫ No standardization ▪ Column-oriented databases - e.g. hbase, cassandra, bigtable, …
▫ Limited query support (work in progress). ▫ Values of columns are stored contiguously.
▫ Eventual consistency ▫ Better performance while accessing few columns and aggregations.
▫ Good for data-warehousing, business intelligence, CRM, ...

# Big Data - Hadoop Developer @ Sunbeam Infotech $ Nilesh Ghule. 7 # Big Data - Hadoop Developer @ Sunbeam Infotech $ Nilesh Ghule. 8
NoSQL Categories MongoDb
Hadoop @ SunBeam Infotech

Hadoop @ SunBeam Infotech


▪ Graph databases - e.g. Neo4J, Titan, … ▪ Developed by 10gen in 2007. Publically available in 2009.
▫ Graph is collection of vertices and edges (lines connecting vertices). ▪ Open-source database (github.com) - controlled by 10gen.
▫ Vertices keep data, while edges represent relationships. ▪ Document oriented database → stores JSON documents.
▫ Each node knows its adjacent nodes. Very good performance, when
want to access all relations of an entity (irrespective of size of data). { "name": "Nilesh Ghule", "age": 34, "salary": 30000.00,
▪ Document oriented databases - e.g. MongoDb, CouchDb, … "permanent": true, hiredate: ISODate("2004-05-31"),
▫ Document contains data as key-value pair as JSON or XML. "skills": [ "Java", "OS", "Hadoop", "C", "C++", "Java EE", "ARM" ],
▫ Document schema is flexible & are added in collection for processing.
▫ RDBMS tables → Collections "contact" : { "email" : "[email protected]",
▫ RDBMS rows → Documents "mobile" : "9527331338"
▫ RDBMS columns → Key-value pairs in document
} }
# Big Data - Hadoop Developer @ Sunbeam Infotech $ Nilesh Ghule. 9 # Big Data - Hadoop Developer @ Sunbeam Infotech $ Nilesh Ghule. 10

MongoDb: Data Types MongoDb Server & Client


Hadoop @ SunBeam Infotech

Hadoop @ SunBeam Infotech


▪ In MongoDb, JSON data is stored in binary i.e. BSON.
▪ MongoDb server (mongod) is developed in C, C++ and JS.
null 10 ▪ MongoDb data is accessed via multiple client tools
boolean 8 true, false
▫ mongo : client shell (JS).
▫ mongofiles : stores larger files in GridFS.
number 1 / 16 / 18 123, 456.78, NumberInt(“24”), NumberLong(“28”) ▫ mongoimport / mongoexport : tools for data import / export.
▫ mongodump / mongorestore : tools for backup / restore.
string 2 “....”
▪ MongoDb data can be accessed in application through client drivers
date 9 new Date(), ISODate(“yyyy-mm-ddThh:mm:ss”) available for all major programming languages e.g. Java, Python,
Ruby, PHP, Perl, …
array 4 [ …, …, …, … ]
▪ Mongo shell is follows JS syntax and allow to execute JS scripts.
object 3 {…}

# Big Data - Hadoop Developer @ Sunbeam Infotech $ Nilesh Ghule. 11 # Big Data - Hadoop Developer @ Sunbeam Infotech $ Nilesh Ghule. 12
Mongo - INSERT Mongo - QUERY
Hadoop @ SunBeam Infotech

Hadoop @ SunBeam Infotech


▪ show databases; ▪ db.contacts.find(); → returns cursor on which following ops allowed:
▪ use database; ▫ hasNext(), next(), skip(n), limit(n), count(), toArray(), forEach(fn),
▪ db.contacts.insert({name: "nilesh", mobile: "9527331338"}); pretty()
▪ db.contacts.insertMany([ ▪ Shell restrict to fetch 20 records at once. Press "it" for more records.
▪ db.contacts.find( { name: "nilesh" } );
{name: "nilesh", mobile: "9527331338"},
▪ db.contacts.find( { name: "nilesh" }, { _id:0, name:1 } );
{name: "nitin", mobile: "9881208115"} ▪ Relational operators: $eq, $ne, $gt, $lt, $gte, $lte, $in, $nin
]); ▪ Logical operators: $and, $or, $nor, $not
▪ Maximum document size is 16 MB. ▪ Element operators: $exists, $type
▪ For each object unique id is generated by client (if _id not provided). ▪ Evaluation operators: $regex, $where, $mod
▫ 12 byte unique id :: [counter(3) | pid(2) | machine(3) | timestamp(4)] ▪ Array operators: $size, $elemMatch, $all, $slice
# Big Data - Hadoop Developer @ Sunbeam Infotech $ Nilesh Ghule. 13 # Big Data - Hadoop Developer @ Sunbeam Infotech $ Nilesh Ghule. 14

Mongo - DELETE Mongo - UPDATE


Hadoop @ SunBeam Infotech

Hadoop @ SunBeam Infotech


▪ db.contacts.remove(criteria); ▪ db.contacts.update(criteria, newObj);
▪ db.contacts.deleteOne(criteria); ▪ Update operators: $set, $inc, $dec, $push, $each, $slice, $pull
▪ db.contacts.deleteMany(criteria); ▪ In place updates are faster (e.g. $inc, $dec, …) than setting new
▪ db.contacts.deleteMany({}); → delete all docs, but not collection object. If new object size mismatch with older object, data files are
▪ db.contacts.drop(); → delete all docs & collection as well : efficient fragmented.
▪ Update operators: $addToSet
▪ example: db.contacts.update( { name: "peter" },
{ $push : { mobile: { $each : ["111", "222" ], $slice : -3 } } } );
▪ db.contacts.update( { name: "t" }, { $set : { "phone" : "123" } }, true );
▫ If doc with given criteria is absent, new one is created before update.
# Big Data - Hadoop Developer @ Sunbeam Infotech $ Nilesh Ghule. 15 # Big Data - Hadoop Developer @ Sunbeam Infotech $ Nilesh Ghule. 16
Data Modeling Data Modeling
Hadoop @ SunBeam Infotech

Hadoop @ SunBeam Infotech


▪ Embedded Data Models ▪ Normalized Data Models
▫ { name : { fname : "Nilesh", lname : "Ghule" }, age : 34 } ▫ contacts → { _id: 11, email : "[email protected]", mobile: "9527331338" }
▫ Suitable for one-to-one or one-to-many relationship. ▫ persons → { _id: 1, name : "Nilesh Ghule", contact: 11 }
▫ Faster read operation. Related data fetch in single db operation. ▫ Preferred for complex many-to-many relationship.
▫ Atomic update of document. ▫ Reduce data duplication.
▫ Document growth reduce write performance and may lead to ▫ Can use DBRef() to store document reference:
fragmentation. ◾ contacts → { _id: 11, email : "[email protected]", mobile:
"9527331338" }
◾ persons → { _id: 1, name: "Nilesh Ghule", contact: { $db : "test",
$ref : "contacts", $id: 11 } }
▫ DBRef() are not supported in all client-drivers.

# Big Data - Hadoop Developer @ Sunbeam Infotech $ Nilesh Ghule. 17 # Big Data - Hadoop Developer @ Sunbeam Infotech $ Nilesh Ghule. 18

Mongo - MapReduce Mongo- Aggregation Pipeline


Hadoop @ SunBeam Infotech

Hadoop @ SunBeam Infotech


▪ For processing large volumes of data into useful aggregate results. ▪ db.collection.aggregate( [ { stage1 }, { stage2 }, ... ] );
▪ Mongodb applies map phase to each input document. The map ▪ $project → select columns (existing or computed)
function emits key-value pairs. ▪ $match → where clause (criteria)
▪ For keys with multiple values, mongodb applies reduce phase. ▪ $group → group by
▪ All map-reduce functions written in JS and are executed in mongod ▫ { $group: { _id: <expr>, <field1>: { <accum1> : <expr1> }, ... } }
server process. ▫ The possible accumulators are: $sum, $avg, ...
▪ The map-reduce works on single collection data. ▪ $unwind → extract array elements from array field
▪ The output of map-reduce can be written into some collection. ▪ $lookup → left outer join
▪ The input & output collections can be sharded. ▫ { $lookup: { from: other_col, localField: cur_col_field,
▪ The aggregation framework provides better performance. MR is foreignField: other_col_field, as: arr_field_alias } }
used for functionalities not available in aggregation framework. ▪ $out → put result of pipeline in another collection (last operation)
# Big Data - Hadoop Developer @ Sunbeam Infotech $ Nilesh Ghule. 19 # Big Data - Hadoop Developer @ Sunbeam Infotech $ Nilesh Ghule. 20
Mongo - Indexes Mongo - GeoSpatial Queries
Hadoop @ SunBeam Infotech

Hadoop @ SunBeam Infotech


▪ db.books.find( { "subject" : "C" } ).explain(true); ▪ mongodb support three types of geo-spatial queries:
▪ explain() → explains the query execution plan. ▫ 2-d index: traditional long-lat. used in older mongodb (2.2-).
▪ Above query by default does full collection scan, hence slower. ▫ 2-d sphere index: data can be stored as GeoJSON.
▪ db.books.createIndex( { "subject" : 1 } ); ▫ geo-haystack: query on very small area. not much used.
▪ Searching on indexed columns reduces query execution time. ▪ GeoJSON stores geometry and coordinates: http://geojsonlint.com/
▪ Options can be provided (2nd arg): { unique : true } ▫ { type: "geometry", coordinates: [ long, lat ] };
▫ { type: "Point", coordinates: [ 73.86704859999998, 18.4898445 ] };
▫ Duplicate values are not allowed in that field.
▪ By default "_id" field is indexed in mongodb (unique index). ▪ Possible geometry types are:
▫ Point, LineString, MultiLineString, Polygon
▪ db.books.getIndexes();
▪ allowed queries: inclusion - $geoWithin, intersection -
▪ db.books.dropIndex({ "subject" : 1 });
$geoIntersects, proximity - $near
# Big Data - Hadoop Developer @ Sunbeam Infotech $ Nilesh Ghule. 21 # Big Data - Hadoop Developer @ Sunbeam Infotech $ Nilesh Ghule. 22

Mongo - GeoSpatial Queries Mongo - Capped Collections


Hadoop @ SunBeam Infotech

Hadoop @ SunBeam Infotech


▪ Capped collections are fixed sized collections for high-throughput
▪ For faster execution create geo-index :
▫ db.busstops.createIndex( { location : "2dsphere" } ); insert and retrieve operations.
▪ Example proximity query: ▪ They maintain the order of insertion without any indexing overhead.
▪ The oldest documents are auto-removed to make a room for new
db.busstops.find( { location : { $near : { records. The size of collection should be specified while creation.
$geometry : { type : "Point" , ▪ The update operations should be done with index for better
coordinates : [73.86704859999998, 18.4898445] performance. If update operation change size, then operation fails.
▪ Cannot delete records from capped collections. Can drop collection.
},
▪ Capped collections cannot be sharded.
$maxDistance : 200 ▪ db.createCollection("logs", { capped: true, size: 4096 }); → if size is
} } } ); below 4096, 4096 is considered. Higher sizes are roundup by 256.
# Big Data - Hadoop Developer @ Sunbeam Infotech $ Nilesh Ghule. 23 # Big Data - Hadoop Developer @ Sunbeam Infotech $ Nilesh Ghule. 24
Mongo - WiredTiger Storage Mongo - WiredTiger Storage
Hadoop @ SunBeam Infotech

Hadoop @ SunBeam Infotech


▪ Storage engine is managing data in memory and on disk. ▪ WT uses write-ahead transaction in journal log to ensure durability.
▪ MongoDB 3.2 onwards default storage engine is WiredTiger; while ▪ It creates one journal record for each client initiated write operation.
earlier version it was MMAPv1. ▪ Journal persists all data modifications between checkpoints.
▪ WiredTiger storage engine: ▪ Journals are in-memory buffers that are synced on disk per 50 ms.
▫ Uses document level optimistic locking for better performance. ▪ WiredTiger stores all collections & journals in compressed form.
▫ Per operation a snapshot is created from consistent data in memory. ▪ Recovery process with journaling:
▫ The snapshot is written on disk, known as checkpoint → for recovery. ▫ Get last checkpoint id from data files.
▫ Checkpoints are created per 60 secs or 2GB of journal data. ▫ Search in journal file for records matching last checkpoint.
▫ Old checkpoint is released, when new checkpoint is written on disk and ▫ Apply operations in journal since last checkpoint.
updated in system tables. ▪ WiredTiger use internal cache with size max of 256 MB and 50%
▫ To recover changes after checkpoint, enable journaling. RAM - 1GB along with file system cache.
# Big Data - Hadoop Developer @ Sunbeam Infotech $ Nilesh Ghule. 25 # Big Data - Hadoop Developer @ Sunbeam Infotech $ Nilesh Ghule. 26

Mongo - GridFS Mongo - GridFS


Hadoop @ SunBeam Infotech

Hadoop @ SunBeam Infotech


▪ GridFS is a specification for storing/retrieving files exceeding 16 MB. ▪ fs.chunks
▪ GridFS stores a file by dividing into chunks of 255 kb. When queried ▫ _id, files_id, n, data
back, driver collect the chunks as requested. Query can be range ▪ fs.files
query. Due to chunks file can be accessed without loading whole file ▫ _id, length, chunkSize, updateDate, md5, filename, contentType
in memory. ▪ Files can be searched using
▪ It uses two collections for storing files i.e. fs.chunks, fs.files. ▫ db.fs.files.find( { filename: myFileName } );
▪ It is also useful to keep files and metadata synced and deployed ▫ db.fs.chunks.find( { files_id: myFileID } ).sort( { n: 1 } )
automatically across geographically distributed replica set. ▫ GridFS automatically create indexes for faster search.
▪ GridFS should not be used when there is need to update contents of ▪ mongofiles:
entire file atomically. ▫ mongofiles.exe -d test put nilesh.jpg
▪ It can be accessed using mongofiles tool or compliant client driver. ▫ mongofiles.exe -d test get nilesh.jpg
# Big Data - Hadoop Developer @ Sunbeam Infotech $ Nilesh Ghule. 27 # Big Data - Hadoop Developer @ Sunbeam Infotech $ Nilesh Ghule. 28
Mongo - Replication Mongo - Replication
Hadoop @ SunBeam Infotech

Hadoop @ SunBeam Infotech


▪ A replica set is a group of
▪ Secondary servers communicate with each other via heart-beat.
mongod instances that
▪ Secondary applies operations from primary asynchronously.
maintain the same data set.
▪ Only one member is deemed ▪ When primary cannot communicate a secondary for more than 10
the primary node, while other seconds, secondary will hold election to elect itself as new primary.
nodes are deemed secondary This automatic failover process takes about a minute.
nodes. ▪ An arbiter (do not store data) can be added in the system (with even
▪ The secondaries replicate the number of secondaries) to maintain quorum in case of election.
primary’s oplog. ▪ By default client reads from primary, but can set read preference
▪ If the primary is unavailable, from secondary. Reading from secondary may not reflect state of
an eligible secondary will primary; as read from primary may read before data is durable.
become primary.
# Big Data - Hadoop Developer @ Sunbeam Infotech $ Nilesh Ghule. 29 # Big Data - Hadoop Developer @ Sunbeam Infotech $ Nilesh Ghule. 30

Mongo - Sharding Mongo - Sharding


Hadoop @ SunBeam Infotech

Hadoop @ SunBeam Infotech


▪ Sharding is a method for
▪ Collections can be sharded across the servers based on shard keys.
distributing large data across
▪ Shard keys:
multiple machines.
▫ Consist of immutable field/fields that are present in each document
▪ This is mongodb approach for
▫ Only one shard key. To be chosen when sharding collection.
horizontal scaling/scaling out.
Cannot change shard key later.
▪ shard: part of collection on
▫ Collection must have index starting on shard key.
each server (replica set).
▫ Choice of shard key affect the performance.
▪ mongos: query router
▪ Advantages:
between client & cluster.
▫ Read/Write load sharing
▪ config servers: metadata &
▫ High storage capacity
config settings of cluster.
▫ High availability
# Big Data - Hadoop Developer @ Sunbeam Infotech $ Nilesh Ghule. 31 # Big Data - Hadoop Developer @ Sunbeam Infotech $ Nilesh Ghule. 32
Mongo - Sharding
Hadoop @ SunBeam Infotech

Hadoop @ SunBeam Infotech


▪ Sharding strategies:
▫ Hashed sharding
◾ MongoDB compute hash of shard key field's value.
◾ Each chunk is assigned a range of docs based on hashed key.
◾ Even data distribution across the shards. However range-based queries Nilesh Ghule
will target multiple shards. <[email protected]>
▫ Ranged sharding
◾ Divides data into ranges based on shard key values.


mongos can target only those shards on which queried range is available.
Efficiency of sharding is based on choosing proper shard key.
Thank you

# Big Data - Hadoop Developer @ Sunbeam Infotech $ Nilesh Ghule. 33 # Big Data - Hadoop Developer @ Sunbeam Infotech $ Nilesh Ghule. 34

You might also like