BDT Unit 02 - Part1
BDT Unit 02 - Part1
Technologies
Department of Computer Engineering and
Technology
Course Objectives:
•• Understand the various aspects of Big Data.
•• Learn the concepts of NoSQL for Big Data.
•• Design an application for distributed systems on Big Data.
•• Explore the various Big Data visualization tools.
Course Outcomes:
•• Apply the insights of Big Data in business applications.
•• Illustrate the application of MongoDB in real world applications.
•• Build hadoop based distributed systems for real world problem.
•• Apply and utilize big data visualization tools for real world applications.
UNIT II- 4
4
Centralized Database
UNIT II- 5
5
Centralized
Database
UNIT II- 6
6
It has decreased
It is less costly the risk of data
because fewer management, i.e.,
vendors are manipulation of
required to handle data will not affect
the data sets. the core data.
of enables
organizations to
establish data
maintained as it
manages data in a
central repository.
Centralized standards.
Database
UNIT II- 7
The size of the centralized
database is large, which
increases the response time
for fetching the data.
Limitations
of not easy to update such an
Centralized extensive database system.
Database
If any server failure occurs,
entire data will be lost, which
could be a huge loss.
UNIT II- 8
Distributed Database System
• Data stored at a number of sites
each site logically consists of a
single processor.
• Processors at different sites are
interconnected by a computer
network no multiprocessors
– parallel database systems
• Distributed database is a
database, not a collection of files
data logically related as exhibited
in the users’ access patterns
– relational data model
Distributed Database System • D-DBMS is a full-fledged DBMS
– not remote file system, not a TP
system
– Unlike a centralized
database system,
data is distributed
among different
database systems
of an organization.
– are connected via
communication
links. which help
the end-users to
access the data
easily.
– Examples : Apache
Cassandra, HBase,
Ignite, etc.
UNIT II- 11
11
Distributed Database Types
Homogeneous DDB:
•• Execute on the same operating
system
•• use the same application
process
•• carry the same hardware
Heterogeneous
devices. DDB:
•• execute on different operating
systems
•• under different application
procedures,
•• carries different hardware
devices. UNIT II- 12
12
Advantages
One server failure will not affect the entire data set.
Data can be joined and updated from different tables which are located on
different machines.
It is secure.
UNIT II- 14
14
Disadvantages
• The distributed database is quite complex
• This database is more expensive as it is complex and hence,
difficult to maintain.
• As it is distributed system it requires database to be more
secure and each and ever node should be secure as well.
• Data integrity and Data Redundancy will not be maintained.
UNIT II- 16
Relational Database
UNIT II- 17
17
Relational
Database
Representation
UNIT II- 18
18
4. Cloud Database
UNIT II- 19
Cloud
Database
UNIT II- 20
Object-oriented Database
UNIT II- 22
Hierarchical Database :
It organizes data in a
tree-like structure.
UNIT II- 23
Hierarchical Databases :
UNIT II- 24
Network Database
UNIT II- 25
Network
Database
Continued
UNIT II- 26
NoSQL Database
a type of database that is used for storing a wide range of data sets.
non-relational
presents a wide variety of database technologies in response to the
demands.
Schema-less
relaxes one or more of the ACID properties(Will be explained in CAP
theorem)
UNIT II- 27
27
Motivation for NoSQL
Database
UNIT II- 28
28
Structured vs Unstructured
•Semi-structured data:
Textual data files with a
discernible pattern that
enables parsing (such as
Extensible Markup
Language [XML] data
files that are
self-describing and
defined by an XML
schema)
Data Structures (contd..)
•Unstructured data :
■ Unstructured data has no pre-defined format or organization,
making it much more difficult to collect, process, and analyze.
■ Unstructured data has internal structure but is not structured via
pre-defined data models or schema.
■ It may be textual or non-textual, and human- or
machine-generated.
■ It may also be stored withinUNIT
a non-relational
II- database like NoSQL.
34
34
Types of data in Big Data
Scenario
• Semi-Structured data :
■ maintains internal tags and markings that identify
separate data elements, which enables information
grouping and hierarchies.
■ Both documents and databases can be
semi-structured.
■ This type of data only represents about 5-10% of the
structured/semi-structured/unstructured data pie.
■ Typical use is in OO models
■ Examples : CSV,XML,JSON
UNIT II- 35
35
Difference between Structured and Unstructured Data
UNIT II- 36 36
Motivation
for NoSQL
Database
•Relational Databases
are not suitable for
distributed computing
UNIT II- 37
37
Motivation
for NoSQL
Database
UNIT II- 38
38
Motivation
for NoSQL
Database
Performance of RDBMS for
various applications
UNIT II- 39
39
Addressing system growth
Every database has This makes data When the memory Scalability :
to be scaled to available at all times of the database is capability of a
address the huge for users. drained, or when it system, network, or
amount of data cannot handle process to handle a
being generated multiple requests, it growing amount of
each day. is not scalable. work, or its potential
to be enlarged to
accommodate that
growth.
Vertical Scaling and
Horizontal Scaling
UNIT II- 40
Scaling datbases
• Types of Scaling :
UNIT II- 41
Vertical Scaling
UNIT II- 43
Horizontal Scaling
UNIT II- 44
Horizontal Scaling
Limitations :
Advantages : ● Making joins is difficult,
cheap compared to vertical scaling. due to cross-server
● Lesser Load, Better performance. communication.
● Chances of downtime are less. ● Eventual consistency is
● Resilience and Fault Tolerance. only possible.
● Suitable for Distributed Databases ● It may not be best suited
for bank transactions,
UNIT II- 45
Horizontal and Vertical scaling
What is NoSQL?
UNIT II- 48
48
When and when not to use NoSQL
UNIT II- 49
49
NoSQL : Applications and
Popularity
UNIT II- 50
50
Schema-less Data
Model
• No fixed schema to
consider
• No implicit datatypes
• Most considerations
done at application
layer including
transactions
• All aggregate data is
gathered in
documents.
UNIT II- 51
51
CAP – NoSQL Data models
•• Consistency : the data in the database remains consistent after the execution of an operation.
•• Availability : the system is always on (Service guarantee availability), no downtime
•• Partition Tolerance : the system continues to function even if the communication among the
servers is unreliable, i.e. the servers may be partitioned into multiple groups that cannot
communicate with one another.
UNIT II- 54
CAP Theorem
CA : Consistent and Available
Examples : standalone Mysql server/node which has no replication.
It provides consistency and availability till it goes down.
Applications : Bank Account Balance,Text messages which require higher
consistency.RDBMS are CA systems.
AP : Available and Partition Tolerant
Example : Distributed NoSQL database where replication to nodes happens
asynchronously.
system will always respond, but not all the nodes will have the latest version of the
data when queried
Applications :E Commerce Sites which focus on high availability in case of partitions
in distributed environment by trading off consistency..
CP : Consistent and Partition Tolerant
Similar to CA systems but difference is its applicability to distributed environment.
In Mongodb the primary node is replicated into secondary nodes.If the primary node
fails then system switches to secondary node.During this switch data is not made
available to user.
UNIT II- 55
CAP Theorem
UNIT II- 56
CAP –
NoSQL
Datamodels
• CA - Single site cluster,
therefore all nodes are
always in contact. When
a partition occurs, the
system blocks.
• CP - Some data may not
be accessible, but the
rest is still
consistent/accurate.
• AP - System is still
available under
partitioning, but some of
the data returned may
be inaccurate.
CAP Theorem
UNIT II- 58
In the absence of network failure – that is, when
the distributed system is running normally – both
availability and consistency can be satisfied.
CAP
Theorem
CAP is frequently misunderstood as if one has to
choose to abandon one of the three guarantees
at all times. In fact, the choice is really between
consistency and availability only when a
network partition or failure happens; at all
other times, no trade-off has to be made.
UNIT II- 59
A network partition refers to network
decomposition into relatively independent
subnets for their separate optimization as well
as network split due to the failure of network
devices.In both cases the partition-tolerant
behavior of subnets is expected.
CAP
Partition Tolerance or robustness means
Theorem that a given system continues to operate
even with data loss or node failure.
UNIT II- 60
CAP Theorem
• Database systems designed wit traditional ACID guarantees
in mind such as RDBMS choose consistency over
availability,
UNIT II- 63
The BASE Properties
UNIT II- 64
The BASE Properties
Characteristics
• Basically
• Atomicity Available (CP)
• Consistency • Soft-state
• Isolation • Eventually
• Durability consistent (AP)
67
Data Models and
Types of NoSQL DBs
NoSQL Data
Model
Distributed Key-Value Systems -
Lookup a single value for a key
•• Amazon’s Dynamo
Document-based Systems - Access
data by key or by search of
“document” data.
Types of •• CouchDB
•• MongoDB
NoSQL
Column-based Systems
databases
•• Google’s BigTable
•• Facebook’s Cassandra
UNIT II- 71
Key-Value Pair (KVP) Stores
Access data (values) by strings called keys.
Example systems
•• Amazon Dynamo
“Value” is stored as a “blob”
- Without caring or knowing what is inside
74 UNIT II-
Key-Value Store
UNIT II- 75
Column-ba
sed Data
Model
• Column Family :
• Column is the smallest
instance of data.
• It is a row containing
name,value and
timestamp.
• Examples : Apache
Cassandra used by
Facebook
UNIT II- 76
Column-based Data Model
This type of data store is good for
Systems
Nodes may
have properties
(including ID)
❖ Based on graph
theory
❖ Vertical Scaling
❖ No clustering
❖ Transactions exist
❖ ACID followed
❖ Examples
:Neo4j,Amazon
Neptune, OrientDB,
Dgraph.
UNIT II- 79
Graph
Databases
UNIT II- 80
•In general, graph
databases are useful when
you are more interested in
relationships between data
than in the data itself: for
example, in representing
and traversing social
networks, generating
recommendations, or
conducting forensic
investigations (e.g. pattern
detection).
used to store data as JSON-like document.
UNIT II- 83
Document databases are good for storing and managing
Big Data-size collections of literal documents, like text
documents, email messages, and XML documents, as
well as conceptual “documents” like de-normalized
(aggregate) representations of a database entity such as a
product or customer.
Database
Massive volumes of data (Big Data) are easily handled by NoSQL
databases.
UNIT II- 85
NoSQL Document-Based Data Model
MongoDB Database
UNIT II- 86
History of MongoDB
MongoDB Overview
What is MongoDB?
● Strings
● Number
● Boolean,
● Objects and
● Arrays
JSON Format Example
{
"book": [
"id":"01",
"language": "Java",
"edition": "third",
},
"id":"07",
"language": "C++",
"edition": "second"
"author": "E.Balagurusamy" }]
}
BSON
• “Binary JSON”
{
"_id" : "37010"
"city" : "ADAMS",
"pop" : 2660,
"state" : "TN",
“councilman” : {
name: “John Smith”
address: “13 Scenic Way”
}
}
Why MongoDB ?
Why MongoDB ? Cont…
Replic
a
Pro’s and Con’s of MongoDB
SQL Vs MongoDB
database database
Table, View Collection
Row Document (BSON Document)
Column Field
Index Index
Table Join Embedded documents & Linking
Primary key Primary Key
Specify any unique column or column In MongoDB, the primary key is
combination as primary key. automatically set to the _id field.
aggregation (e.g. group by) aggregation pipeline
Schema design
⚫ RDBMS: join
Schema design
Schema design
MongoDB: Hierarchical Objects
UNIT-II
Replication
⚫ Replica Sets and Master-Slave
⚫ replica sets are a functional superset of master/slave
and are handled by much newer, more robust code.
Replication
⚫ Only one server is active for writes (the primary, or
master) at a given time – this is to allow strong
consistent (atomic) operations. One can optionally
send read operations to the secondaries when
eventual consistency semantics are acceptable.
Why Replica Sets
⚫ Data Redundancy
⚫ Automated Failover
⚫ Read Scaling
⚫ Maintenance
⚫ Disaster Recovery(delayed secondary)
Sharding for horizontal scaling
UNIT-II
Sharding
• Sharding is the process of distributing data across multiple
machines and it is MongoDB's approach to meet the demands of
data growth.
UNIT-II
Actual
Sharding
Replication & Sharding conclusion
⚫ sharding is the tool for scaling a system, and
replication is the tool for data safety, high availability,
and disaster recovery. The two work in tandem yet are
orthogonal concepts in the design.
Shard
– Shard: Each shard contains a subset of the sharded data. Each shard
can be deployed as a replica set.
113
mongod
– config servers:
– Config servers store metadata and configuration settings for
the cluster.
– As of MongoDB 3.4, config servers must be deployed as a
replica set CSRS (Config Servers as Replica Sets)
• For queries that include the shard key or the prefix of a compound
shard key, mongos can target the query at a specific shard or set of
shards. These targeted operations are generally more efficient
than broadcasting to every shard in the cluster.
UNIT II- 125
Advantages of Sharding
2. Storage Capacity
3. High Availability
• A sharded cluster can continue to perform partial read /
write operations even if one or more shards are
unavailable.
• mongos –
– Sharding processes
– Analogous to a database router.
– Processes all requests
– Decides how many and which mongods should receive the query
– Mongos collates the results, and sends it back to the client.
UNIT II-
130
Terminology and Concepts
SQL Terms\ Concepts Mongo DB Terms\ Concepts
Database Database
Table Collection
Row Documents
Column Field
Index Index
Primary key Primary Key
How To Install MongoDB?
• To install the MongoDB first download the mongodb
community server version 4.2.12(zip file) as per your
Operating System from :
https://www.mongodb.com/try/download/community
Select Zip
Connectivity Between Client and
Server
Client Configuration :
G:\mongodb-win32-x86_64-2012plus-4.2.12\mongo
db-win32-x86_64-2012plus-4.2.12\bin>mongo.exe
Executables
Mongo
MySQL Oracle Informix DB2
DB
Database DB2
mongod mysqld oracle IDS
Server Server
Database DB-Acces DB2
mongo mysql sqlplus
Client s Client
MongoDB CRUD Operations
• Create Operations
• Read Operations
• Update Operations
• Delete Operations
show dbs
Print a list of all databases on the server.
• Syntax :
>db.createCollection (“ collection_name”)
Consider ,we want to store student information
so we can create a Student_info collection as
follows :
>>db.createCollection (“Student_info”)
>db.collectionname.insert({key1:value,key2:value,...})
Inserts a document or documents into a collection.
>db.Student_info.insert ({id:“151”,
name:“Vasundhara”,city:”Pune”})
>db.collectionname.find()
▪ Update🡪
db.collection.update( <query>, <update>, <options> )
syntax : db.collection.update(query, update, options)
Example :
db.Student_info.update(
{ name:”Pankaj”},
{ $set: { age:24 } }
)
Example
:>db.Student_info.update({name:"Ruhi"},{$set:{age:23}},{upsert
:true})
Example : Assume there are 2 documents with name as Jyoti.So to update city
value for both we can use following command :
>db.Student_info.update({name:"Jyoti"},{$set:{city:"Bang
alore"}},{multi:true})
Delete operation
▪ Delete🡪
db.collection.remove( <query>, <justOne> )
▪ Collection specifies the collection or the ‘table’ to
store the document
Example :
> db.Student_info.remove({name:"Pankaj"})
• Pattern Matching :
db.Student_info.find( { name: { $regex: “^V.*” } } ) //starting with ‘V’