0% found this document useful (0 votes)
14 views20 pages

Large Scale Database-1

Uploaded by

mohamedtraka321
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views20 pages

Large Scale Database-1

Uploaded by

mohamedtraka321
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Transaction Management & Concurrency Control

What is a Transaction ?
• Logical unit of work that must be either entirely completed or aborted
- Successful transaction changes database from one consistent state to another
- One in which all data integrity constraints are satisfied
• Most real-world database transactions are formed by two or more database requests
- Equivalent of a single SQL statement in an application program or transaction
Transaction Results
• Not all transactions update database - SQL code represents a transaction because database was accessed
• Improper or incomplete transactions can have devastating effect on database integrity
- Some DBMSs provide means by which user can define enforceable constraints
- Other integrity rules are enforced automatically by the DBMS
Transaction Properties (ACID)
• Multiuser databases subject to multiple concurrent transactions
Atomicity • All operations of a transaction must be completed
• All of the work in a transaction completes (commit) or none of it completes
Consistency • Permanence of database’s consistent state,
• Transforms the database from one consistent state to another consistent state. in terms of constraints.
Isolation • Data used during transaction cannot be used by second transaction until the first is completed
• The results of any changes made during a transaction are not visible until the transaction has committed.
Durability • Once transactions are committed, they cannot be undone
• The results of a committed transaction survive failures
Serializability • Concurrent execution of several transactions yields consistent results

Transaction Management)SQL(
• ANSI has defined standards govern SQL database transactions provided by two SQL statements: COMMIT-ROLLBACK
• Transaction sequence must continue until:
- COMMIT statement is reached - ROLLBACK is reached - End of program is reached - Program is abnormally terminated
Transaction Log
• Record for the beginning of transaction, For each transaction component:
- Type of operation being performed (update, delete, insert)
- Names of objects affected by transaction
- “Before” and “after” values for updated fields
- Pointers to previous and next transaction log entries for the same transaction.
• Ending (COMMIT) of the transaction
Transaction Concurrency Control
• Coordination of simultaneous transaction execution in a multiprocessing database
• Objective is to ensure serializability of transactions in a multiuser environment
Lost Updates • Two concurrent transactions update same data element
problem - One of the updates is lost → Overwritten by the other transaction
Uncommitted Data • Two transactions executed concurrently
Phenomenon - First transaction rolled back after second already accessed uncommitted data

Inconsistent Retrievals • First transaction accesses data - Second transaction alters the data
- First transaction accesses the data again, Transaction might read some data before they
are changed and other data after changed, Which Yields inconsistent results.
The Scheduler
• Special DBMS program
- Establish order of operations within which concurrent transactions are executed
• Interleaves execution of database operations
- Ensures serializability, yields same results as serial execution
- Ensures isolation
Concurrency Control (Locking Methods)
Lock • Guarantees exclusive use of a data item to a current transaction
• Required to prevent another transaction from reading inconsistent data
Lock manager • Responsible for assigning and policing the locks used by transactions
Lock Granularity (Indicates level of lock use)
Database-level lock • Entire database is locked
Table-level lock • Entire table is locked
Page-level lock • Entire diskpage is locked
Row-level lock • Allows concurrent transactions to access different rows of same table
• Even if rows are located on same page
Field-level lock • Allows concurrent transactions to access same row
• Requires use of different fields (attributes) within the row
Lock Types
Binary lock • Two states: locked (1) or unlocked (0)
Exclusive lock • Access is specifically reserved for transaction that locked object
• Must be used when potential for conflict exists
Shared lock • Concurrent transactions are granted read access on basis of a common lock

Concurrency Control with Time (Stamping Methods)


• Assigns global unique time stamp to each transaction,Produces explicit order in which transactions are submitted to DBMS
Uniqueness • Ensures that no equal time stamp values can exist
Monotonicity • Ensures that time stamp values always increase
Schemes
Wait/die • Older transaction waits and younger is rolled back and rescheduled

Wound/Wait • Older transaction rolls back and younger transaction and reschedules it

Database Recovery Management (Restores database to previous consistent state)


• Based on atomic transaction property
- All portions of transaction treated as single logical unit of work
- All operations applied and completed to produce consistent database
• If transaction operation cannot be completed
- Transaction aborted
- Changes to database rolled back
Transaction Recovery
Write-ahead-log protocol • Ensures transaction logs are written before data is updated
Redundant transaction logs • Ensure physical disk failure will not impair ability to recover
Buffers • Temporary storage areas in primary memory
Checkpoints • Operations in which DBMS writes all its updated buffers to disk
Deferred-write technique (Only transaction log is updated)
1. Identify last checkpoint
2. If transaction committed before checkpoint → Do nothing
3. If transaction committed after checkpoint → Use transaction log to redo the transaction
4. If transaction had ROLLBACK operation → Do nothing
Write-through technique (Database is immediately updated by transaction operations during transaction’s execution)
1. Identify last checkpoint
2. If transaction committed before checkpoint → Do nothing
3. If transaction committed after checkpoint → DBMS redoes the transaction using “after” values
4. If transaction had ROLLBACK or was left active operation → Do nothing because no updates were made
Summary
• Transaction: sequence of database operations that access database, Logical unit of work
• No portion of transaction can exist by itself
• Five main properties: atomicity, consistency, isolation, durability, and serializability
• COMMIT saves changes to disk
• ROLLBACK restores previous database state
• SQL transactions are formed by several SQL statements or database requests
• Transaction log keeps track of all transactions that modify database
• Concurrency control coordinates simultaneous(interleaved) execution of transactions
• Scheduler establishes order in which concurrent transaction operations are executed
• Lock guarantees unique access to a data item by transaction
• Two types of locks: binary locks and shared/exclusive locks
• Serializability of schedules is guaranteed through the use of two-phase locking
• Deadlock: when two or more transactions wait indefinitely for each other to release lock
• Three deadlock control techniques: prevention, detection, and avoidance
• Time stamping methods assign unique time stamp to each transaction
• Schedules execution of conflicting transactions in time stamp order
• Optimistic methods assume the majority of database transactions do not conflict
• Transactions are executed concurrently, using private copies of the data
• Database recovery restores database from given state to previous consistent state
Distributed Databases
What constitutes a Distributed database ?
• Connection of database nodes over computer network,logical interrelation of connected db, possible absence homogeneity
Centralized Distributed
• Stored, located as well as maintained at a single location only. • Consists of multiple databases connected with each other, spread across different loc.
• Data access time in case of multiple users is more. • Data access time in the case of multiple users is less.
• Management, modification, and backup of this database are easier • Management, modification, and backup of this database are very difficult as it is
as the entire data is present at the same location. spread across different physical locations.
• Provides a uniform and complete view to the user. • Since it spread across different locs, it is difficult to provide a uniform view to user.
• Has more data consistency in comparison to distributed database. • This database may have some data replications thus data consistency is less.
• Users cannot access database in case of database failure occurs. • In distributed database, if one database fails users have access to other databases.
• Centralized database is less costly . • This database is very expensive.

Transparency : Refers to hiding implementation details from end users for a seamless experience.
• Data organization transparency → Location - Naming • Design transparency
• Fragmentation transparency → Horizontal - Vertical • Execution transparency
• Replication transparency
Availability • The probability that the distributed system is continuously available during a specified time interval.
Reliability • The probability that the distributed system is running (not down) at a certain point in time.
Horizontal scalability • Expanding the number of nodes
Vertical scalability • Expanding capacity of individual nodes
Partition tolerance • System continues operating during network partitioning
Autonomy : Determines the extent to which nodes can operate independently
Design autonomy • independence of data model and transaction management techniques
Communication autonomy • determines sharing of information between nodes
Execution autonomy • independence of user actions
Advantages of Distributed Databases
• Improved ease and flexibility of application development: Development can occur at geographically dispersed sites.
• Improved performance: Data localization reduces network traffic and latency.
• Increased availability: Faults are isolated to their site of origin, ensuring availability.
• Easier expansion via scalability: Adding nodes or increasing individual node capacity easier compared to non-distributed sys.
Data Fragments • Logical units of the database that are distributed across nodes.

Horizontal fragmentation (sharding): • Divides a relation into subsets of tuples, called shards, by specifying conditions on attributes or
other methods. It groups rows to create these subsets.
Vertical fragmentation • Dividing relation by columns, keeping only certain attributes of the relation.
Complete horizontal fragmentation • Apply UNION operation to the fragments to reconstruct relation
Complete vertical fragmentation • Apply OUTER JOIN operation to reconstruct relation
Mixed fragmentation • Combination of horizontal and vertical fragmentations

Replication and allocation • Strategies for storing fragments and replicas across nodes for performance and availability.
Fully replicated distributed DB • Whole database is replicated at every site.
Nonredundant allocation • Each fragment is stored at exactly one site.
Partial replication • Some fragments are replicated while others are not.

Types of Distributed Database Systems


Homogeneous DDBMS • Uniform software across nodes.
Heterogeneous DDBM • Nodes with different software and configurations.
• Multi database system and federated database system (FDBS) with shared global schema.
• Federated database management systems issues: Differences in data models, constraints, query languages, and semantic heterogeneity.
Client-Server database architecture
• It consists of clients running client software, a set of servers which provide all database functionalities and a reliable
communication infrastructure.
NoSQL Databases
• Relational databases → mainstay of business
• Hooking RDBMS to web-based application becomes trouble
• Web-based applications caused spikes → Explosion of social media sites (Facebook, Twitter) with large data needs
- Rise of cloud-based solutions such as Amazon S3 (simple storage solution)

What is NoSQL ? (Not Only SQL)


• Introduced by Carl Strozzi in 1998 to name his file-based database.
• Re-introduced by Eric Evans.“whole point of seeking alternatives is that you need to solve problem that relational databases are a bad fit for.”
Key features (advantages) :
• non-relational • Horizontal scalable
• don’t require schema • Cheap, easy to implement (open-source)
• data are replicated to multiple nodes • Massive write performance
(identical & fault-tolerant) and can be • Fast key-value access
partitioned:down nodes easily replaced
NoSQL Distinguishing Characteristics
• Large data volumes → Google’s “big data” • Mostly query, few updates
• Scalable replication and distribution • ACID transaction properties are not needed
(Potentially thousands of machines- distributed • CAP Theorem
around the world) • Open source development
Disadvantages:
• Don’t fully support relational features
- No join, group by, order by operations
- No referential integrity constraints across partitions
- No declarative query language (SQL) → more programming
• Relaxed ACID (see CAP theorem) → fewer guarantees
- (ACID: Atomicity, Consistency, Isolation, Durability)
- (CAP: Consistency, Availability, Partition tolerance)
• No easy integration with other applications that support SQL
Relational databases • Data stored as table rows, Relationships between related rows
• Single entity spans multiple tables
• RDBMS systems are very mature, rock solid
NoSQL databases • Data stored as documents
• Single entity (document) is a single record
• Documents do not have a fixed structure
CAP Theorem Suppose three properties of a distributed system (sharing data)
Consistency • All copies have same value
Availability • Reads and writes always succeed
Partition-tolerance • System properties (consistency and/or availability) hold even when network failures
prevent some machines from communicating with others
• Brewer’s CAP Theorem:
- For any system sharing data, it is “impossible” to guarantee simultaneously all of these three properties
- You can have at most two of these three properties for any shared-data system
• Cloud computing
- ACID is hard to achieve, moreover, it is not always required, )for blogs, status updates, product listings(, etc.
NoSQL Categories → “No-schema” common characteristics that provides “flexible” data types
Key-Values Databases
• Key–value databases allow applications to store data in a schema-less way
- No fixed data model →The data could be stored in a datatype of a programming language or an object
• You can store whatever you like in Values→
- It is the responsibility of the application to understand what was stored
• Designed to handle massive load
- Data model: (global) collection of Key-value pairs→
What is Redis ?
- Ultra-fast in-memory key-value data store
- Open-source software
- Powerful data structures server, Stores data structures: Strings, Lists, Hash tables, Sets / sorted set
What is DynamoDB ?
- items having one or more attributes (name, value)
- An attribute can be single-valued or multi-valued like set.
- items are combined into a table

Pros Cons
• very fast • Can’t model more complex data structure such as objects
• very scalable(horizontally distributed to nodes based on key)
• simple data model
• eventual consistency
• fault-tolerance
Document-based
• Can model more complex objects
- Data model: collection of documents, i.e. semi-structured data
• Differently from key-values, there are some limits:
- Definition of allowable structures and types - More flexibility when accessing data, indexed allowing quick classification
• Document: JSON
• (JavaScript Object Notation is a data model, key-value pairs, which supports objects, records, structs, lists, array,
maps, dates, Boolean with nesting), other semi-structured formats.
• Examples: Mongo DB, Apache SOLR, Elasticsearch
What is MongoDB?
- Very powerful and mature NoSQL database - Auto sharding – clustering & data partitioning
- Scalable, high-performance, open-source - Indexing and powerful querying
- JSON-style document storage, schemaless - Map-Reduce – parallel data processing
- Replication & high-availability support - GridFS – store files of any size
What is Apache CouchDB?
- Open-source NoSQL database - On-the-fly document transformation
- Document-based database server: stored JSON documents - Real-time change notifications
- HTTP-based API - Highly available and partition tolerant
- Query, combine, and transform documents with JavaScript
• Ad-hoc and schema-free with a flat address space.
- The CouchDB file layout and commitment system features all Atomic Consistent Isolated Durable (ACID) properties.
• Document updates (add, edit, delete) are serialized, except for binary blobs which are written concurrently.
- CouchDB read operations use a Multi-Version Concurrency Control (MVCC) model where each client sees a consistent
snapshot of the database from the beginning to the end of the read operation.
• Eventually Consistent
Column-based
• Based on Google’s BigTable paper,
- Tables similarly to RDBMS, but handle semi-structured
• Data model:
- Collection of Column Families: (key, value) where value = set of related columns (standard, super)
- indexed by row key, column key and timestamp
• One column family can have variable numbers of columns
- Cells within a column family are sorted “physically”
• Query on multiple tables
- RDBMS: must fetch data from several places on disk and glue together
- Column-based NOSQL: only fetch column families of those columns that are required by a query
(all columns in a column family are stored together on the disk, so multiple rows can be retrieved in one read operation → data locality)
- Ex :BigTable, Cassandra, Hbased
Graph-based
• Focus on modeling the structure of data (interconnectivity)
- Scales to the complexity of data
• Data model: (Property Graph) nodes and edges
- Nodes may have properties (including ID)
- Edges may have labels or roles
- Key-value pairs on both
• Interfaces and query languages vary
- Ex: Neo4j, FlockDB, Pregel, InfoGrid
What is Neo4j ?
- Open source- Java-based - Stores data as nodes and relationships
- Neo4j is a NoSQL Graph Database - Full ACID transactions
- Database full of linked nodes - Schema free, bottom-up data model design
• Nodes represent entities, Edges represent relationships
- Connections between data are explored
- Faster for associative data sets
- Intuitive
• Optimal for searching social network data
- Strengths:Powerful data modelFast→ For connected data, can be many orders of magnitude faster than RDBMS
- Weaknesses:Sharding→ Though they can scale reasonably well-And for some domains you can shard too!
Schemaless Databases
• Easily store whatever you need, and change your data storage as you learn more about your project
• Easily add new things as you discover them
• Easier to deal with non-uniform data: data where each record has a different set of fields (limiting sparse data storage)
Conclusion
• NOSQL database cover only a part of data-intensive cloud applications (mainly Web applications)
• NoSQL databases is an approach to data management that is useful for very large sets of distributed data
• NoSQL should not be misleading: the approach does not prohibit Structured Query Language (SQL)
• And indeed they are commonly referred to as “NotOnlySQL”
Example: (MongoDB) document
{Name:"Jaroslav",
Address:"Malostranske nám. 25, 118 00 Praha 1”,
Grandchildren: {Claire: "7", Barbara: "6", "Magda: "3", "Kirsten: "1", "Otis: "3", Richard: "1“}
Phones: [ “123-456-7890”, “234-567-8963” ]
}
Document-Based : MongoDB
• Document-oriented NOSQL systems typically store data as collections of similar documents.
- Although the documents in a collection should be similar, they can have different data elements (attributes), and new
documents can have new data elements that do not exist in any of the current documents in the collection.
- The system basically extracts the data element names from the self-describing documents in the collection.
• A popular language to specify documents in NOSQL systems is JSON (JavaScript Object Notation).
- A document database is a type of NoSQL database which stores data as JSON documents instead of columns and rows.
- JSON is a native language used to both store and query data.
- These documents can be grouped together into collections to form database systems.
Example of a document that consists of 4 key value pairs:
• Using JSON enables developers to store and query data in the same document-model format.
• The model can be converted into other formats, such as JSON, BSON and XML.
Benefits of Document Databases
- Everything is available in a single database, rather than having information spread across several linked databases.
- Get better performance compared to an SQL database. {
- Unlike in conventional databases where a field exists for each piece of information, "ID" : "001",
even if there's nothing in it, a document-store is more flexible. "Book" : "Java: The Complete Reference",
- Since it’s more flexible, compared to a relational database, any new type of "Genre" : "Reference work",
information must NOT be added to all datasets. "Author" : "Herbert Schildt",
Conventional Table Vs Document-based database }

- Document database is a type of non-relational database that is designed to store and query data as JSON-like documents.
- The document model works well with use cases such as catalogs, user profiles, and content management systems where each
document is unique and evolves over time.
- Document databases enable flexible indexing, powerful ad hoc queries, and analytics over collections of documents.
Document Databases Use Cases→ Book Database
• The relational approach would represent the relationship between books and authors via tables with IDs (PK & FK).
By comparison, the document model shows relationships more naturally and simply by ensuring that each author document has
a property called Books, with an array of related book documents in the property. When you search for an author, the entire book
collection appears.
• Content Management: Developers use document databases to created video streaming platforms, blogs and similar services.
Each file is stored as a single document and the database is easier to maintain. Significant data modifications, such as data
model changes, require no downtime.
• Catalogs"may have thousands of attributes stored. In document databases, attributes related to a single product are stored in a
single document. Modifying one product's attributes does not affect other documents.
Best Document Databases
Amazon DocumentDB • MongoDB-compatible, Fully managed - High performance with low latency querying
• Strong compliance and security, and High availability - Amazon’s entire development team
• The BBC uses it for querying and storing data from multiple data streams
• Rappi switched to Amazon DocumentDB to reduce coding time.
MongoDB • Ad hoc queries - Optimised indexing for querying, Sharding, Load-balancing
• Forbes decreased build time by 58%, gaining a 28% increase in subscriptions due to quicker
building of new features,
• Toyota found it much simpler for developers to work at high speeds by using natural JSON
documents. More time is spent on building the business value instead of data modelling.
Cosmos DB • Any scale fast reads - 99,999% availability - Serverless, cost-effectively/instantly scales
• Coca-Cola gets insights delivered in minutes, facilitating global scaling.
• Before migrating to Cosmos DB, it took hours.
• ASOS needed a distributed database that flexibly and seamlessly scales to handle over 100
million global retail customers.
MongoDB {
title : ‘‘MongoDB’’,
• A database resides on a MongoDB server last editor : ‘‘172.5.123.91’’ ,
• A MongoDB database consists of collections of documents last modified : new Date (‘‘9/23/2010’’) ,
body : ‘‘MongoDB is a ...’’,
• Schema-free, (documents in a collection may be heterogeneous) categories : [‘‘Database’’, ‘‘NoSQL’’,
• Main abstraction and data structure is a document ‘‘Document Database’’],
reviewed : false
MongoDB Data Model }
• Individual documents are stored in a collection has no schema
• Documents in a collection should be similar, they can have different data elements (attributes)
• Self-describing data - Users can request to create indexes on some of the data elements
• Documents are stored in BSON (binary JSON) format, more efficient than JSON
Schema Free
• MongoDB does not need any pre-defined data schema, Every document in a collection could have different data
• Addresses NULL data fields
JSON format
• Data is in name/value pairs consists of a field name followed by a colon, followed by a value:
- Example: “name”: “R2-D2”
• Data is separated by commas → Example: “name”: “R2-D2”, race : “Droid”
• Curly braces hold objects → Example: {“name”: “R2-D2”, race : “Droid”, affiliation: “rebels”}
• An array is stored in brackets [ ] → [{“name”: “R2-D2”, race : “Droid”, affiliation: “rebels”}, {“name”: “Yoda”, affiliation:
“rebels”} ]
MongoDB CRUD Operations
• Documents created and inserted using the insert operation:
- db.<collection_name>.insert(<document(s)>)
• The delete operation is called remove, and the format is:
- db.<collection_name>.remove(<condition>)
• The update operation, which has a condition to select certain documents, and a $set clause to specify the update.
• For read queries, is called find, and the format is:
- db.<collection_name>.find(<condition>)
CRUD operations
Create db.collection.insert( <document> )
db.collection.save( <document> )
db.collection.update( <query>, <update>, { upsert: true } )
Read db.collection.find( <query>, <projection> )
db.collection.findOne( <query>, <projection> )
Update db.collection.update( <query>, <update>, <options> )

Delete db.collection.remove( <query>, <justOne> )

Query Operators
Name Description
$eq Matches value that are equal to a specified value
$gt, $gte Matches values that are greater than (or equal to a specified value
$lt, $lte Matches values less than or ( equal to ) a specified value
$ne Matches values that are not equal to a specified value
$in Matches any of the values specified in an array
$nin Matches none of the values specified in an array
$or Joins query clauses with a logical OR returns all
$and Join query clauses with a loginal AND
$not Inverts the effect of a query expression
$nor Join query clauses with a logical NOR
$exists Matches documents that have a specified field
MongoDB Example
• Create a collection named mycoll with 10,000,000 bytes of preallocated disk space and no automatically
generated and indexed document-field
- db.createCollection(‘‘mycoll’’, size: 10000000, autoIndexId: false)
• Add a document into mycoll
- db.mycoll.insert(title: ‘‘MongoDB’’, last editor: ... )
• Retrieve a document from mycoll
- db.mycoll.find(categories: [‘‘NoSQL’’, ‘‘Document Databases’’])
To populate the inventory collection
• Examples of query operations using the db.collection.find() method in the mongo shell.
• The examples use the inventory collection.
db.inventory.insertMany([
{ item: "journal", qty: 25, size: { h: 14, w: 21, uom: "cm" }, status: "A" },
{ item: "notebook", qty: 50, size: { h: 8.5, w: 11, uom: "in" }, status: "A" },
{ item: "paper", qty: 100, size: { h: 8.5, w: 11, uom: "in" }, status: "D" },
{ item: "planner", qty: 75, size: { h: 22.85, w: 30, uom: "cm" }, status: "D" },
{ item: "postcard", qty: 45, size: { h: 10, w: 15.25, uom: "cm" }, status: "A" }
]);

Query Documents
• To select all documents in the collection: db.inventory.find({})
• Selects from the inventory collection all documents where the status equals "D": db.inventory.find( { status: "D" } )
MongoDB Distributed Systems Characteristics
• Most MongoDB updates are atomic if they refer to a single document, but MongoDB also provides a pattern for
specifying transactions on multiple documents.
• Replication in MongoDB. The concept of replica set is used in MongoDB to create multiple (3) copies of the same
data set on different nodes in the distributed system.
• All write operations must be applied to the primary copy and then propagated to the secondaries.
• For read operations, the user can choose.
• The default read preference processes all reads at the primary copy, so all read and write operations are performed
at the primary node.
• Sharding in MongoDB. Sharding of the documents in the collection—also known as horizontal partitioning—
divides the documents into disjoint partitions known as shards.
• This allows the system to add more nodes as needed and to store the shards of the collection on different nodes to
achieve load balancing.
• Sharding and replication are used together; sharding focuses on improving performance via load balancing and
horizontal scalability, whereas replication focuses on ensuring system availability when certain nodes fail in the
distributed system.

Query Operators
Name Description
$eq Matches value that are equal to a specified value
$gt, $gte Matches values that are greater than (or equal to a specified value
$lt, $lte Matches values less than or ( equal to ) a specified value
$ne Matches values that are not equal to a specified value
$in Matches any of the values specified in an array
$nin Matches none of the values specified in an array
$or Joins query clauses with a logical OR returns all
$and Join query clauses with a loginal AND
$not Inverts the effect of a query expression
$nor Join query clauses with a logical NOR
$exists Matches documents that have a specified field
Create Collection
• db.createCollection(“worker”, { capped : true, size : 5242880, max : 2000 }))
• db.createCollection(“project”, { capped : true, size : 1310720, max : 500 })
find
• db.collection_name.find()
• db.collection_name.findOne({query}, {projection})
• db.collection_name.find().sort({name:1/-1})
• db.collection_name.find({parts: “hammer”}).limit(5)
update
• db.collection_name.update({query}, {update}, {upsert: true})
• db.collection_name.updateMany({query},{update})
• db.collection_name.updateOne({name:”marwn”}, {$set:{fulltime:true}})
Delete
• db.collection_name.delete({query}, {update})
• db.collection_name.deleteMany({query}, {update})
• db.collection_name.deleteOne({name:”marawn”}, {$set:{fulltime:true}})
• db.collection_name.save(<document>)
• db.collection_name.findAndModify(<query>, <sort>,<update>,<new>, <fields>,<upsert>)
• db.collection_name.remove(<query>, <justone>)
• db.collection_name.remove(type: /^h/})
• db.collection_name.remove()
• db.collection_name.save(<document>)
Note:
• Db → Collections → Documents → bison
mongosh establish connection
use mydb Create db
show dbs Show all dbs
show collections Show all collections
exit exit command
cls clear command
db.dropDatabase() Delete Database
db.createCollection() Create Collection
db.createCollection(“students”,{capped:true, size:10000, max:100},{autoIndexId:false}) Advanced Create Collection
db.coll_name.insertOne({}) insert document
db.coll_name.insertMany([{},{},{}]) Insert many document
find({criteria})
db.coll_name.find({name:”Marwan Magdy”}) Find document
db.coll_name.find({age:{$lt:20}}) Less than
db.coll_name.find({age:{$gte:20}}) Greater than
db.coll_name.find({gpa:{$gte:3, $lt:4}}) Grater than or less than
db.coll_name.find({name:{$in:[“”,””]}}) Equal in the array
db.coll_name.find({},{_id:false, name:true})
db.coll_name.find().sort({gpa:1}).limit(5) ascending with limiting results
Logical operator
db.coll_name.find($and:[{fullTime:true},{age:{$gte:20}}])
db.coll_name.find({age:{$not:{$gte:20}}})
updateOne({filter},{update})
db.coll_name.updateOne({name:”Marwan Magdy”},{$set:{fullTime:true}})
db.coll_name.updateOne({name:”Marwan Magdy”},{$unset:{fullTime:””}}) Remove attribute from document
db.coll_name.updateMany({fullTime:{$exists:false},{$set:{fullTime:false}})
db.coll_name.updateMany({},{$set:{fullTime:true}}) Update all documents attributes
deleteOne({filter})
db.coll_name.deleteOne({name:”Marwan Magdy”}) Delete one document
deleteMany({criteria})
db.coll_name.deleteMany({name:true}) Delete many document
db.coll_name.deleteMany({name:{$exists:false}}) Delete many document with criteria
Data Warehousing & Business Intelligence
Business Intelligence
• Used by decision makers to get a comprehensive knowledge of the business and to define and support their business strategies.
• Enable data-based decisions aimed at: gaining competitive advantage, improving operative performance, responding more
quickly to changes, increasing profitability, and, in general, creating added value for the company.
Basic BI Architectural Components
ETL Tools • Data extraction, transformation, and loading (EIL) tools collect, filter, integrate, and aggregate internal and
external data to be saved into a data store optimized for decision support.
Data Store • Optimized for decision support and is generally represented by a data warehouse or a data mart.
• The data is stored in structures that are optimized for data analysis and query speed .
Query and Reporting • Performs data selection and retrieval, and it is used by the data analyst to create queries that access the
database and create the required reports.
Data Monitoring and • Allows real-time monitoring of business activities, in a single integrated view.
• Alerts can be placed on a given metric; once the value of a metric goes below or above a certain baseline, the
Alerting
system will perform a given action, such as emailing shop floor managers,
Data Analytics • Performs data analysis and data-mining tasks using the data in the data store.
• Advises the user how to build a reliable business data model that identify and enhance the understanding of
business situations and problems
Data Visualization • Presents data to the end user in a variety of meaningful and innovative ways.
• Helps the end user Select the most appropriate presentation format, such as summary reports, maps, pie or
bar graphs, mixed graphs, and static or interactive dashboards
Sample of Business Intelligence Tools
Dashboards and • Web-based technologies to present key business performance indicators or information in a single integrated
view, generally using graphics that are clear, concise, and easy to understand.
Business Activity
- Salesforce - IBM/Cognos – iDashboards - Tableau
Data Warehouses • Foundation of a BI infrastructure. Data is captured from the production system and placed in the DW.
- IBM/Cognos – Oracle - Microsoft
OLAP • OLAP tools Online analytical processing provides multidimensional data analysis.
- IBM/Cognos – Oracle - Microsoft
Data-Mining • Provide advanced statistical analysis to uncover problems and opportunities hidden within business data.
- MS Analytics Services - Orange
Data Visualization • Provide advanced visual analysis and techniques to Enhance understanding and create additional insight of
business data and its true meaning.
- Dundas – Tableau – QlikView - Microsoft PowerBI
Business Intelligence Benefits
1. Improved decision making
2. Integrating Architecture → Integrating umbrella for a disparate mix of IT systems within an organization.
3. Common user interface for data reporting and analysis → Provide information using a common interface for all company users.
4. Common data repository fosters single version of company data
5. Improved organizational performance → Provide advantages in many different areas, that can be reflected in added efficiency, reduced
waste, increased sales, and reduced employee and customer turnover.
Operational Vs. Analytical Information
Operational • Information collected and used in support of day to day operational needs in businesses and other orgs.
information Data Makeup Differences Technical Differences Functional Differences
• Typical Time-Horizon: Days/Months • Small Amounts used in a Process • Used by all types of employees
(transactional)
• Detailed • High frequency of Access • for tactical purposes
• Current • Can be Updated • Application Oriented
• Non-Redundant

Analytical • Information collected and used in support of analytical task


information • based on operational (transactional) information
Data Makeup Differences Technical Differences Functional Differences
• Typical Time-Horizon: Years • Large Amounts used in a Process • Used by a narrower set of users for
• Summarized (and/or Detailed) • Low/Modest frequency of Access decision making
• Values over time (Snapshots) • Read (and Append) Only • Subject Oriented
• Redundancy not an Issue
Data Warehouse (Can store detailed and/or summarized data)
• Operational data sources include the databases and other data repositories which are used to support the organization’s day-to-day operations
• Created within an organization as a separate data store whose primary purpose is data analysis.
• Electronic system that gathers data from a wide range of sources within a company and uses the data to support management decision-making.
• Database containing analytically useful information of subject-oriented, integrated, enterprise-wide, historical, and time-variant data.
• Retrieval of analytical information.
Subject-oriented • The data items that describe a subject area (an entity) are linked together.
• Examples of subjects might be customers, products, and sales transactions.
Non-volatile • Data warehouse data is static—not updated or deleted once it is stored.
Integrated • Data that may be from disparate sources is made coherent and consistent.
Time-variant • All data stored in the warehouse is identified as belonging to a specific time period

Why Separate Data Warehouse?


High performance for • DBMS— tuned for OLTP: access methods, indexing, concurrency control, recovery
both systems • Warehouse—tuned for OLAP: complex OLAP queries, multidimensional view, consolidation

Different functions • Decision support requires historical data which operational DBs do not maintain
and different data • consolidation (aggregation, summarization) of data from heterogeneous sources

Requirements for a DW Architecture


Separation • Analytical and transaction processing should be kept apart as much as possible.
Scalability • HW and SW should be easy to upgrade as the data volume and the number of user requirements increase.
Extensibility • Should be possible to host new applications and technologies without redesigning the whole system.
Security • Monitoring accesses is essential because of the strategic data stored in DW.
Administerability • Administration not too difficult.

Data Warehouse Components


Source Systems • Operational databases and other operational data repositories (in other words, any sets of data used for
operational purposes) that provide analytically useful information for the data warehouse’s subjects of analysis
• Can include external data sources
• Every operational data store that is used as a source system for the data warehouse has two purposes:
1. The original operational purpose
2. As a source system for the data warehouse
(ETL) infrastructure • Facilitates the retrieval of data from operational databases into the data warehouses
Extraction-transformation-load • ETL includes the following tasks:
- Extracting analytically useful data from the operational data sources
- Transforming such data so that it conforms to the structure of the subject-oriented target data warehouse model
- Loading the transformed and quality assured data into the target data warehouse
Data Warehouse • Referred to as the target system, to indicate the fact that it is a destination for the data from the source systems
• Typical data warehouse periodically retrieves selected analytically useful data from the operational data
sources
Front-End apps • Used to provide access to the data warehouse for users who are engaging in indirect use.

Data Marts
• Departmental small-scale “DW” that stores only limited/relevant data.
• Provides decision support to a small group of people.
• Could be created from data extracted from a larger data warehouse for the specific purpose of supporting faster data access to
a target group or function.
Data Warehousing Modelling
ER • Predominant technique for visualizing database requirements,
modeling • used extensively for Conceptual modeling of operational databases

Relational • Standard method for Logical modeling of operational databases


modeling • Both of these techniques can also be used during the development of data warehouses and data marts

Dimensional • Modeling technique tailored specifically for modeling data warehouses and data marts
modeling • Used for designing Subject-oriented analytical database.
• Logical design technique aims to present the data in a standard, intuitive form that allows for high-performance access.
• In addition to using the regular relational concepts (primary keys, foreign keys, integrity constraints, etc.) dimensional
modeling distinguishes two types of tables:

Fact Table • One table with a composite primary key , Fields Holding Foreign Keys to dimensions’ tables
• Fields holding the Measures related to the subject of analysis are typically numeric and are intended
for mathematical computation and quantitative analysis
Dimensions • Each dimension table has a non-composite primary key that corresponds exactly to one of the
components of the composite key in the fact table.
• Set of Smaller tables, Contain descriptions of the business
• Concepts that the Manager use to analysis the market’s feature of interest (time, location, Customer, product).

Data Warehouse Schemas


• Various conceptual data models or Schemas can be adopted to design a data warehouse.
• The result of dimensional modelling is a dimensional schema containing facts and dimensions
Star schema • The chosen subject of analysis is represented by a fact table at the center of the schema
(Most Used) - Cube dimension tables around, Considering which dimensions to use with fact table representing the chosen subj.
• For every dimension under consideration, two questions must be answered:
- Question 1: Can the dimension table be useful for the analysis of the chosen subject?
- Question 2: Can the dimension table be created based on the
existing data sources?
• Characteristics of dimensions and facts
- Dimension Table Have orders of magnitude fewer records than
fact tables
- A typical dimension contains relatively static data,
- While in a typical fact table, records are added continually, and
the table rapidly grows in size.
Snowflake • Variant of star schema where some dimensional hierarchy is normalized to smaller dimension tables.
schema - Cube dimension tables are normalized
- It essentially creates more tables and primary–foreign key
relationships,
- which may have a negative impact on report generation due to
the many joins that need to be evaluated.
• Considered if the dimension tables grow too large and more
efficient usage of storage capacity is required.
- Since the latter dimension tables are now smaller compared to the corresponding (unnormalized) star schema.
- They can be more easily stored in internal memory
Fact • Multiple fact tables share dimension tables, viewed as a collection of stars
Constellation - The two fact tables Sales Fact Table and Shipping Fact Table share
the tables Time dimension, Product dimension, and Store dimension.
(‫)مجرة‬
• Collection of star schemas, or a Galaxy schema.
- Quicker development of analytical databases for multiple subjects
of analysis,
- because dimensions are re-used instead of duplicated
- Straightforward cross-fact analysis
Creating A Data Warehouse
• Involves using the functionalities of database management software to implement the data warehouse model as a collection of
physically created and mutually connected database tables
• Most often, data warehouses are modeled as relational databases
Data Origin • Historical data covering the analysis’ period of interest - Extracted from Transactional Databases
• Only data related to the analysis’ subject are extracted. - Only Interested tables are extracted
• Only Interested fields are preserved in each table.
Stored Data • Aggregations computed from the detailed set of original data.

Best Data Warehouse Software


• Teradata • Amazon Redshift
• Cloudera • Microsoft’s Azure Synapse Analytics
• Snowflake • Oracle Autonomous Data Warehouse
• Google BigQuery • IBM Data Warehouse Tools
• Teradata Vantage • SAP Data Warehouse Cloud
Online analytical processing (OLAP)
• Querying and presenting data from data warehouses and/or
data marts for analytical purposes.
• Designed for analysis of dimensionally modeled data
OLAP/BI Tools (many ar web based)
• Allow users to query fact and dimension tables by using simple point-
and-click query-building applications
• The tool writes and executes the code in the language of the data
management system (SQL) that hosts the data warehouse or data mart
that is being queried
• Require dimensional organization of underlying data for performing
basic OLAP operations (slice, pivot, drill(
- Graphically visualizing the answers
- Creating and examining calculated data
- Determining comparative or relative differences
- Performing trend analysis, forecasting, and regression analysis
- Ad-hoc direct analysis of dimensionally modeled data
- Creation of front-end (BI) applications

OLAP Operations
Slice • Select and project on one dimension. Subset of a multidimensional array
• Operations on a Simple Tree-Dimensional Data Cube
Dice • Select and project on two or more dimensions. Slice on more than two dimensions
Slice and Dice • Adds, replaces, or eliminates specified dimension attributes
• (or particular values of the dimension attributes) from the already displayed result
Roll-up • Decrease the aggregate level to view data at a higher level of summarization. (Removing one or more dimension)
• Computing all of the data relationships for one or more dimensions
Drill-down • Increase the aggregate level to view data in more detail. (Adding one or more dimension)
• Navigating among levels of data ranging from the most summarized (up) to the most detailed (down)
• Makes the granularity of the data in the query result finer
Pivot • Re-orient the multidimensional view of data. Allows to rotate a 2D/3D cube’s axis.
• Used to change the dimensional orientation of a report or an ad hoc query-page display
• Reorganizes the values displayed in the original query result by moving values of a dimension column
from one axis to another
Cube Storage
• When designing SQL Server OLAP services cubes, the most two important decisions are:
- The choice of the level of aggregation
- The choice of storage mode

Aggregation
• OLAP Services, aggregations are pre-calculated sums of fact table data at some combination of levels from each dimension.
• Used to answer queries.
• Pre-calculating all possible aggregates would greatly increase the storage requirements for a database.
• At the other extreme, calculating all the aggregates at query time would result in slow query response time.
• When choosing the amount of aggregation to include, there is a tradeoff between the storage space and query response time.
OLAP Servers : Storage Modes
Relational OLAP • Fact data and aggregations are stored in the relational database server.
• Provides OLAP functionality by using relational databases and familiar relational query tools to store and
(ROLAP)
analyze multidimensional data.
ROLAP adds the following extensions to traditional RDBMS technology:
• Multidimensional data schema support (Star schema)
• Data access language and query performance optimized for multidimensional data
• Support for very large databases (VLDBs)
• Gives the appearance of traditional OLAP’s slicing and dicing functionality.
✓ Can handle a large amount of data and can leverage all the functionalities of relational DB.
 Performance is slow and each ROLAP report is an SQL query with all the limitations of the genre.
 It is also limited by SQL functionalities.

Multidimensional OLAP • Fact data and aggregations are stored on the OLAP server in an optimized multidimensional format.
• This is the traditional mode in OLAP analysis.
(MOLAP)
• Data is stored in form of multidimensional cubes and not in relational databases.
✓ Provides excellent query performance and the cubes are built for fast data retrieval.
✓ Calculations are pre-generated when cube is created and can be easily applied while querying data.
 Handle only limited amount of data.Since calculations have been pre-built when the cube was created,
 Cube cannot be derived from a large volume of data.

Hybrid OLAP • Fact data is stored in the relational database,


• aggregations are stored on the OLAP server in an optimized multidimensional format.
(HOLAP)
✓ Combine the strengths of the above two models.
✓ HOLAP leverages MOLAP model and for drilling down into details it uses the ROLAP model.
Comparing the use of MOLAP, HOLAP and ROLAP
• The type of storage medium impacts on:
1. cube processing time
2. cube storage
3. cube browsing speed.
• Cube browsing is the fastest when using MOLAP. This is so even in cases where no aggregations have been done.
• The data is stored in a compressed multidimensional format and can be accessed quickly than in the relational DB.
• Processing time is slower in ROLAP, especially at higher levels of aggregation.
• Browsing is very slow in ROLAP about the same in HOLAP.
• MOLAP storage takes up more space than HOLAP as data is copied and at very low levels of aggregation it takes more room than ROLAP.
• ROLAP takes almost no storage space as data is not duplicated. However ROALP aggregations take up more space than MOLAP or HOLAP
aggregations.
• All data is stored in the cube in MOLAP and data can be viewed even when the original data source is not available.
• MOLAP can handle very limited data only as all data is stored in cube
• In ROLAP data cannot be viewed unless connected to the data source.
Big Data, Hadoop, & Map Reduce
Big Data
• Collection of data sets so large and complex, it’s impossible to process them with the usual databases and tools.
• Because of its size it is hard to: capture, store, delete (privacy), search, share, analyze and visualize.
Volume • Significantly larger amounts of data that are being collected due to an increased number of data sources.

Velocity • in motion rather than data “at rest” as found in traditional database.
• Constantly being transmitted through computer networks, sensors, mobile devices, and satellites.
• Processing big data involves the analysis of streaming data arrives at a fast rate.
Variety • Different forms of data highly heterogeneous textual, video, XML, or sensor.

Veracity • Accuracy of the data being analyzed. Data can be corrupted in transmission, or can potentially be lost.
• Before analyzing any large data set, steps must be taken to improve the trustworthiness of the data.
Value • Advantages that the successful analysis of big data can bring to businesses, organizations, and research labs.

Hadoop
• System for the efficient Storage and Retrieval of large data files, Based on a Simple data model any data will fit.
Distributed file system • The Hadoop storage framework
(HDFS) - Single namespace for entire cluster
- Replicates data 3x for fault-tolerance
MapReduce • Parallel Programming Paradigm
implementation - Executes user jobs specified as “map” and “reduce” functions
- Manages work distribution & fault-tolerance

Scalable • It can reliably store and process petabytes.

Economical • It distributes the data and processing across clusters of commonly available computers (in thousands).

Efficient • By distributing the data, it can process it in parallel on the nodes where the data is located

Reliable • Automatically maintains multiple copies of data and redeploys computing tasks based on failures.

1. Hadoop Distributed File System


Advantage • Capable of representing data sets that are too large to fit within the storage capacity of a single machine by
distributing data across a network of machines.
• Provides fault tolerance since it is capable of dividing a file into subcomponents, known as blocks, and then
replicating copies of each block.
• If one block is unavailable, HDFS can access another copy of the same block on a different machine.
• Distributing the blocks of a file across multiple computers also supports parallel processing, thus leading to faster
computational capabilities.
Disadvantage • Complex to manage all of the file components and metadata that are needed to store and retrieve data from
distributed locations.
• HDFS has Master/Slave architecture
NameNode • Master server that manages the file system namespace and regulates access to files by clients.
• Stores metadata for the files, like the directory structure of a typical FS.
• The server holding the NameNode instance is quite crucial, as there is only one.
• Transaction log for file deletes/adds, etc. Does not use transactions for whole blocks or file-streams, only metadata.
• Handles creation of more replica blocks when necessary after a DataNode failure
DataNode • Usually one per node in the cluster, which Manage storage attached to the nodes that they run on.
• Stores the actual data in HDFS - Can run on any underlying filesystem (ext3/4, NTFS, etc)
• Notifies NameNode of what blocks it has NameNode replicates blocks 2x in local rack, 1x elsewhere
Hadoop Cluster
• 40 nodes/rack, 1000-4000 nodes in cluster1 GBps bandwidth in rack, 8 GBps out of rack
• Node specs (Yahoo terasort): 8 x 2.0 GHz cores, 8 GB RAM, 4 disks (= 4 TB?)
• Node is a single computer → Rack is a collection of about 30–40 nodes on the same network switch. → Cluster is a collection of racks.
• Data set in HDFS is divided into blocks → Block is the basic unit of storage. Blocks default to a size of 64 megabytes.
• HDFS will distribute and replicate the blocks → By default, Each block has 3x replicas that are spread across two racks.
• Files are BIG (100s of GB – TB) → Typical usage patterns → Append-only, Data are rarely updated, common
• Files split into 64-128 MB blocks (called chunks)
- Blocks replicated (usually 3 times) across several datanodes (called chuck or slave nodes) Chunk nodes are compute nodes too
- Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes.
- Single namenode (master node) stores metadata (file names, block locations, ….)May be replicated also
• Client library for file access
- Talks to master to find chunk servers Connects directly to chunk servers to access data , Master node is not a bottleneck
- Computation is done at chuck node (close to data)
Goals of HDFS
Very Large Distributed File System • 10K nodes, 100 million files, 10 PB
Convenient Cluster Management • Load balancing - Node failures - Cluster expansion
Optimized for Batch Processing • Allow move computation to data - Maximize throughput

2. MapReduce Framework [introduced by Google]→ Code usually written in Java. Two fundamental components:
Map • Filter and/or transform the input data into a more appropriate form in preparation for the reduce step.
• Master node takes large problem and slices it into smaller sub problems; distributes these to worker nodes.
• Worker node may do this again if necessary.
• Worker processes smaller problem and hands back to master.
Reduce • Perform calculation or aggregation over the data that it receives from the map step to achieve the results.
• Master node takes the answers to the sub problems and combines them in a predefined way to get the output/answer to
original problem.
• Used to do a programming job that requires reading the entire file to perform computation.
• The computation is designed to operate in parallel on the individual blocks of the file and then merge the results.
• Programming model for expressing distributed computations at a massive scale
Per cluster node:
Single JobTracker per • Responsible for scheduling the jobs’ component tasks on the slaves and Monitor slave progress
master • Run on Master node , Re-execute failed tasks , Accepts Job requests from clients
Single TaskTracker per • Execute the task as directed by the master
slave • Run on slave nodes Forks separate Java process for task instances

Ex : Counting words given a large collection of documents, output the frequency for each unique word.
1. When you put this data into HDFS, Hadoop automatically splits into blocks and replicates each block.
2. Input reader reads a block and divides into splits.
3. Each split would be sent to a map function. E.g., a line is an input of a map function.
4. The key could be some internal number, the value is the content of the textual line.
5. Mapper takes the output generated by input reader and output a list of intermediate <key, value> pairs.
6. There is shuffle/sort before reducing.
7. Reducer takes the output generated by the Mapper, aggregates the value for each key, and outputs the final result.
Execution Details
Input reader • Divide input into splits, assign each split to a Map task, Input reader reads a block and divides into splits.
• Each split would be sent to a map function. E.g., a line is an input of a map function.
• The key could be some internal number, the value is the content of the textual line.
Map task • Apply the Map function to each record in the split, Each Map function returns a list of (key, value) pairs
Shuffle/Partition and • Shuffle distributes sorting & aggregation to many reducers
Sort • All records for key k are directed to the same reduce processor
• Sort groups the same keys together, & prepares for aggregation
Reduce task • Apply the Reduce function to each key, The result of the Reduce function is a list of (key, value) pairs

Properties of MapReduce Engine


• Job Tracker is the master node (runs with the namenode)
- Receives the user’s job
- Decides on how many tasks will run (number of mappers)
- Decides on where to run each mapper (concept of locality)
• Task Tracker is the slave node (runs on each datanode)
- Receives the task from Job Tracker
- Runs the task until completion (either map or reduce task)
- Always in communication with the Job Tracker reporting progress

Limitation
• Have to use M/R model
• Not Reusable
• Error prone
• For complex jobs → Multiple stage of Map/Reduce functions

Hadoop Software Products


❖ The Apache Hadoop open source software project provides free downloads of modules including Hadoop Common,
- HDFS, YARN for job scheduling,
- Hadoop MapReduce.
- The download includes Hadoop-related tools as Hive, HBase, the Pig, and the Mahout data mining library.
❖ Database vendors have also begun to provide software products based on Apache Hadoop.
❖ Many of these vendors also provide free virtual machines (VM) for learning the basics of the Hadoop.
❖ IBM, for example, provides the InfoSphere BigInsights Quick Start.
❖ Other virtual machines for the Hadoop framework are provided by vendors such as Cloudera, Hortonworks, Oracle, and Microsoft.

You might also like