0% found this document useful (0 votes)
16 views

Chapter14_BigData&NoSQLDatabases

Database class

Uploaded by

chamso Abou
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Chapter14_BigData&NoSQLDatabases

Database class

Uploaded by

chamso Abou
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 39

Big Data and NoSQL

Databases

CSC3326
Learning Objectives

• Understand Big Data and its 3Vs


• Get introduced to NoSQL databases
• Understand the difference between Relational and NoSQL databases
• Distinguish the different types of NoSQL databases
• Explain how a document database such as MongoDB stores and
manipulates data
Big Data
• The rapid pace of data growth is the top challenge for organizations, with system performance
and scalability as the main challenges.

• Big data is a movement to find new and better ways to manage large amounts of data and
derive business insight from it, providing high performance and scalability at a reasonable cost
=> 3Vs.

• Volume: the quantity of data to be stored. The storage capacities associated with Big Data are
extremely large.

• Velocity: the rate at which new data enters the system as well as the rate at which the data
must be processed.

• Variety: the variations in the structure of the data to be stored. Data can be structured,
unstructured, or semistructured.
Big Data
As the quantity of data needing to be stored increases, the need for
larger storage devices increases as well:
Scaling up is keeping the same number of systems, but migrating each
system to a larger system:
Scaling out means that when the workload exceeds the capacity of a
server, the workload is spread out across a number of servers. This is also
referred to as clustering—creating a cluster of low-cost servers to share a
workload.
Big Data
• There is a need for databases that can provide:
 Scalability
 Flexibility
 Cost
 Availability
NoSQL
• Although much of the transactional data that organizations use works well in a
structured environment, most of the data in the world is semistructured or
unstructured.
• Relational databases impose a structure on the data when the data is captured
and stored.
• Big Data requires that the data be captured in whatever format it naturally
exists, without any attempt to impose a data model or structure to the data.
NoSQL
• NoSQL represents a broad array of nonrelational database technologies
that have developed to address the challenges represented by Big Data
• NoSQL DBs are built to be flexible, scalable and capable of rapidly
responding to the data management demands of Big Data applications.
• NoSQL DBs represent a different way of approaching the storage and
processing of data in a nonrelational way.
• NoSQL DBs do not force data to fit predefined structures.
• NoSQL DBs provide distributed, fault-tolerant databases for processing
unstructured data.
• NoSQL DBs are not based on the relational model and SQL.
NoSQL
• One of four categories: key-value data stores, document databases,
column-oriented databases, and graph databases
Data Consistency in Distributed Systems
• Distributed systems offer a range of benefits, including increased scalability, fault tolerance, and performance
• However, managing data consistency in distributed systems is a very complex problem
• Two consistency modes:
 Strong consistency: it is a requirement for data to be consistently and identically available across all
server nodes globally. At any given time, all server nodes should have the same value for a given entity. That
means that data across nodes need to be updated immediately after a write request was made to one of the
server nodes. During that time, access to data is locked.
 Eventual consistency: allows for temporary inconsistencies between server nodes in the system. This means
that the data across nodes will get consistent eventually. This will take time for updates to reach other nodes. This
makes data highly available=> access to data is not locked.
• Strong consistency provides immediate consistency but can result in higher latency and lower
availability. In contrast, eventual consistency prioritizes availability but can lead to temporary data
inconsistencies.
• When choosing between strong and eventual consistency, it’s important to consider the specific
requirements of your system.
• Non relational DBs adopt eventual consistency while relational DBs adopt strong consistency.
Strong Consistency Mode
Performance
• NoSQL DBs provide high scalability and high performance.
• Example: a blog website data.
• In a document-based NoSQL, all data related to each post is collected
into a self-contained single document containing data about user, post
details and comments.
• In a relational DB, this data will be split into three tables: user, post and
comment.
What are the benefits of NoSQL

databases?
Flexible data models: NoSQL databases typically have very flexible schemas.

• Horizontal scaling: most NoSQL databases allow you to scale-out horizontally, meaning you
can add cheaper commodity servers whenever you need to.

• Fast queries: Queries in NoSQL databases can be faster than SQL databases. Data in SQL
databases is typically normalized, so queries require you to join data from multiple tables. As
your tables grow, the joins can become expensive. However, data in NoSQL databases is
typically stored in a way that is optimized for queries => The rule of thumb is data that is
accessed together should be stored together.
The Disadvantages of NoSQL Databases
• NoSQL databases also have their own limitations and weaknesses.

• The lack of SQL: lack of a standard query language

• The lack of ACID: ACID stands for the four key properties that define a transaction
(Atomicity, Consistency, Isolation, and Durability) and NoSQL does not support these
properties.
NoSQL Databases
SQL Databases

Document: JSON documents, Key-


value: key-value pairs, Wide-column:
Data Storage Model Tables with fixed rows and columns
tables with rows and dynamic
columns, Graph: nodes and edges

Developed in the late 2000s with a


Developed in the 1970s with a focus focus on scaling and allowing for
Development History
on reducing data duplication rapid application change driven by
agile and DevOps practices.
Document: MongoDB and CouchDB, Key-value:
Oracle, MySQL, Microsoft SQL Server, and
Examples Redis and DynamoDB, Wide-column: Cassandra
PostgreSQL
and HBase, Graph: Neo4j and Amazon Neptune

Document: general purpose, Key-value: large


amounts of data with simple lookup queries, Wide-
Primary Purpose General purpose column: large amounts of data with predictable
query patterns, Graph: analyzing and traversing
relationships between connected data

Schemas Rigid Flexible

Scaling Vertical (scale-up with a larger server) Horizontal (scale-out across commodity servers)
Most do not support
multi-record ACID
Multi-Record ACID
Supported transactions.
Transactions
However, some — like
MongoDB — do.

Joins Typically required Typically not required


Key-Value Databases
• Key-value (KV) databases are conceptually the simplest of the NoSQL data models
• KV database is a NoSQL database that stores data as a collection of key-value pairs. The key acts
as an identifier for the value. The value can be anything such as text, an XML document, or an
image.
• The database does not attempt to understand the contents of the value component or its
meaning—the database simply stores whatever value is provided for the key
• Key-value pairs are typically organized into “buckets.” A bucket can roughly be thought of as the
KV database equivalent of a table. A bucket is a logical grouping of keys. Key values must be
unique within a bucket, but they can be duplicated across buckets.
• Operations on KV databases are rather simple—get, store, and delete operations are
used. Get or fetch is used to retrieve the value component of the pair.
Document Databases
• Document databases are conceptually similar to key-value databases, and they can almost be considered a
subtype of KV databases.

• A document database is a NoSQL database that stores data in tagged documents in key-value pairs.

• Unlike a KV database where the value component can contain any type of data, a document database always
stores a document in the value component.

• The document can be in any encoded format, such as XML or JSON (JavaScript Object Notation)

• While KV databases do not attempt to understand the content of the value component, document databases
do

• Despite the use of tags in documents, document databases are considered schema-less, that is, they do not
impose a predefined structure on the data that is stored

• Being schema-less means that although all documents have tags, not all documents are required to have the
same tags, so each document can have its own structure

• Tags inside the document are accessible to the DBMS, which makes sophisticated querying possible.

• Document databases group documents into logical groups called collections.


Column-Oriented Databases
• A column family database is a NoSQL database that organizes data in key-value pairs with keys
mapped to a set of columns in the value component.

• Each row key in the column family can have different columns.
Graph Databases
• A graph database is a NoSQL database based on graph theory to store data about relationship-rich
environments.

• Modeling and storing data about relationships is the focus of graph databases.

• The primary components of graph databases are nodes, edges, and properties

• The node is a specific instance of something we want to keep data about.

• Properties are like attributes; they are the data that we need to store about the node

• An edge is a relationship between nodes.

• Edges can be in one direction, or they can be bidirectional.

• A query in a graph database is called a traversal.


Aggregate Awareness

• Key-value, document, and column family databases are aggregate aware


• Aggregate aware means that the data is collected or aggregated around a central
topic or entity.
• For example, a blog website might organize data around individual blog posts. All data
related to each blog post is aggregated into a single denormalized collection that
might include data about the blog post (title, content, and date posted), the poster
(user name and screen name), and all comments made on the post (comment content
and commenter’s user name and screen name). In a normalized, relational database,
this same data might call for USER, BLOGPOST, and COMMENT tables.
• Determining the best central entity for forming aggregates is one of the most
important tasks in designing most NoSQL databases.
Working with MongoDB
• The name, MongoDB, comes from the word humongous as its developers
intended their new product to support extremely large data sets. It is designed
for:
• High availability
• High scalability
• High performance
• As a document database, MongoDB is schema-less and aggregate aware
• Schema-less means that all documents are not required to conform to the same
structure, and the structure of documents does not have to be declared ahead of
time.
Working with MongoDB

• Data is stored in documents, documents of a similar type are stored in


collections, and related collections are stored in a database.
• Documents are formatted using JSON for storage.
• JavaScript Object Notation (JSON) is a data interchange format that represents
data as a logical object.
• Objects are enclosed in curly brackets {} that contain key-value pairs.
Working with MongoDB (JSON)
• A single JSON object can contain many key:value pairs separated by commas.
• A simple JSON document to store data on a book might look like this:
{_id: 101, title: “Database Systems”}
• This document contains two key:value pairs:
o _id is a key with 101 as the associated value
o title is a key with “Database Systems” as the associated value

• The value component may have multiple values that would be appropriate for a given key
• When there are multiple values for a single key, an array is used.
• Arrays in JSON are placed inside square brackets []. For example, the above document
could be expanded to:
{_id: 101, title: “Database Systems”, author: [“Coronel”, “Morris”]}
Embedded documents
• Objects can also have other objects embedded as a value.
• Consider another simple document with data about a publisher that is
related to the book in the previous example.
Embedded documents
• In a relational environment, we would have used a BOOK table and a
PUBLISHER table with a 1:M relationship.
• Although this increases redundancy, NoSQL databases often sacrifice
redundancy to improve scalability.
• With document databases, we are attempting to avoid the need for joins,
making documents independent of each other so they can be easily scaled out
to many computers in a cluster.
Creating Databases and Collections in
MongoDB
• MongoDB databases comprise collections of documents.
• Each MongoDB server can host many databases.
• A database object contains collections. Collections are also objects.
Collection objects contain document objects.
• In addition to holding data content, an object can also have methods,
which are programmed functions for manipulating the object.
• MongoDB has two versions of the command-line MongoDB shell and a
graphical interface called MongoDB Compass.
• A list of the databases available on the server can be retrieved with the
command:
show dbs
• The following command creates a database named demo
use demo
• Using the createCollection() method with the db variable creates a
collection with the specified name. The following command creates a
“newproducts” collection inside the previously defined demo database:
db.createCollection(“newproducts”)
Inserting Documents in MongoDB

• db.<collection name>.insertOne({document})

• db.<collection name>.insertMany([{document1}, {document2},


{document3}])
• The following command displays all of the documents in the product
collection
db.products.find()

You might also like