Large Scale Database-1
Large Scale Database-1
What is a Transaction ?
• Logical unit of work that must be either entirely completed or aborted
- Successful transaction changes database from one consistent state to another
- One in which all data integrity constraints are satisfied
• Most real-world database transactions are formed by two or more database requests
- Equivalent of a single SQL statement in an application program or transaction
Transaction Results
• Not all transactions update database - SQL code represents a transaction because database was accessed
• Improper or incomplete transactions can have devastating effect on database integrity
- Some DBMSs provide means by which user can define enforceable constraints
- Other integrity rules are enforced automatically by the DBMS
Transaction Properties (ACID)
• Multiuser databases subject to multiple concurrent transactions
Atomicity • All operations of a transaction must be completed
• All of the work in a transaction completes (commit) or none of it completes
Consistency • Permanence of database’s consistent state,
• Transforms the database from one consistent state to another consistent state. in terms of constraints.
Isolation • Data used during transaction cannot be used by second transaction until the first is completed
• The results of any changes made during a transaction are not visible until the transaction has committed.
Durability • Once transactions are committed, they cannot be undone
• The results of a committed transaction survive failures
Serializability • Concurrent execution of several transactions yields consistent results
Transaction Management)SQL(
• ANSI has defined standards govern SQL database transactions provided by two SQL statements: COMMIT-ROLLBACK
• Transaction sequence must continue until:
- COMMIT statement is reached - ROLLBACK is reached - End of program is reached - Program is abnormally terminated
Transaction Log
• Record for the beginning of transaction, For each transaction component:
- Type of operation being performed (update, delete, insert)
- Names of objects affected by transaction
- “Before” and “after” values for updated fields
- Pointers to previous and next transaction log entries for the same transaction.
• Ending (COMMIT) of the transaction
Transaction Concurrency Control
• Coordination of simultaneous transaction execution in a multiprocessing database
• Objective is to ensure serializability of transactions in a multiuser environment
Lost Updates • Two concurrent transactions update same data element
problem - One of the updates is lost → Overwritten by the other transaction
Uncommitted Data • Two transactions executed concurrently
Phenomenon - First transaction rolled back after second already accessed uncommitted data
Inconsistent Retrievals • First transaction accesses data - Second transaction alters the data
- First transaction accesses the data again, Transaction might read some data before they
are changed and other data after changed, Which Yields inconsistent results.
The Scheduler
• Special DBMS program
- Establish order of operations within which concurrent transactions are executed
• Interleaves execution of database operations
- Ensures serializability, yields same results as serial execution
- Ensures isolation
Concurrency Control (Locking Methods)
Lock • Guarantees exclusive use of a data item to a current transaction
• Required to prevent another transaction from reading inconsistent data
Lock manager • Responsible for assigning and policing the locks used by transactions
Lock Granularity (Indicates level of lock use)
Database-level lock • Entire database is locked
Table-level lock • Entire table is locked
Page-level lock • Entire diskpage is locked
Row-level lock • Allows concurrent transactions to access different rows of same table
• Even if rows are located on same page
Field-level lock • Allows concurrent transactions to access same row
• Requires use of different fields (attributes) within the row
Lock Types
Binary lock • Two states: locked (1) or unlocked (0)
Exclusive lock • Access is specifically reserved for transaction that locked object
• Must be used when potential for conflict exists
Shared lock • Concurrent transactions are granted read access on basis of a common lock
Wound/Wait • Older transaction rolls back and younger transaction and reschedules it
Transparency : Refers to hiding implementation details from end users for a seamless experience.
• Data organization transparency → Location - Naming • Design transparency
• Fragmentation transparency → Horizontal - Vertical • Execution transparency
• Replication transparency
Availability • The probability that the distributed system is continuously available during a specified time interval.
Reliability • The probability that the distributed system is running (not down) at a certain point in time.
Horizontal scalability • Expanding the number of nodes
Vertical scalability • Expanding capacity of individual nodes
Partition tolerance • System continues operating during network partitioning
Autonomy : Determines the extent to which nodes can operate independently
Design autonomy • independence of data model and transaction management techniques
Communication autonomy • determines sharing of information between nodes
Execution autonomy • independence of user actions
Advantages of Distributed Databases
• Improved ease and flexibility of application development: Development can occur at geographically dispersed sites.
• Improved performance: Data localization reduces network traffic and latency.
• Increased availability: Faults are isolated to their site of origin, ensuring availability.
• Easier expansion via scalability: Adding nodes or increasing individual node capacity easier compared to non-distributed sys.
Data Fragments • Logical units of the database that are distributed across nodes.
Horizontal fragmentation (sharding): • Divides a relation into subsets of tuples, called shards, by specifying conditions on attributes or
other methods. It groups rows to create these subsets.
Vertical fragmentation • Dividing relation by columns, keeping only certain attributes of the relation.
Complete horizontal fragmentation • Apply UNION operation to the fragments to reconstruct relation
Complete vertical fragmentation • Apply OUTER JOIN operation to reconstruct relation
Mixed fragmentation • Combination of horizontal and vertical fragmentations
Replication and allocation • Strategies for storing fragments and replicas across nodes for performance and availability.
Fully replicated distributed DB • Whole database is replicated at every site.
Nonredundant allocation • Each fragment is stored at exactly one site.
Partial replication • Some fragments are replicated while others are not.
Pros Cons
• very fast • Can’t model more complex data structure such as objects
• very scalable(horizontally distributed to nodes based on key)
• simple data model
• eventual consistency
• fault-tolerance
Document-based
• Can model more complex objects
- Data model: collection of documents, i.e. semi-structured data
• Differently from key-values, there are some limits:
- Definition of allowable structures and types - More flexibility when accessing data, indexed allowing quick classification
• Document: JSON
• (JavaScript Object Notation is a data model, key-value pairs, which supports objects, records, structs, lists, array,
maps, dates, Boolean with nesting), other semi-structured formats.
• Examples: Mongo DB, Apache SOLR, Elasticsearch
What is MongoDB?
- Very powerful and mature NoSQL database - Auto sharding – clustering & data partitioning
- Scalable, high-performance, open-source - Indexing and powerful querying
- JSON-style document storage, schemaless - Map-Reduce – parallel data processing
- Replication & high-availability support - GridFS – store files of any size
What is Apache CouchDB?
- Open-source NoSQL database - On-the-fly document transformation
- Document-based database server: stored JSON documents - Real-time change notifications
- HTTP-based API - Highly available and partition tolerant
- Query, combine, and transform documents with JavaScript
• Ad-hoc and schema-free with a flat address space.
- The CouchDB file layout and commitment system features all Atomic Consistent Isolated Durable (ACID) properties.
• Document updates (add, edit, delete) are serialized, except for binary blobs which are written concurrently.
- CouchDB read operations use a Multi-Version Concurrency Control (MVCC) model where each client sees a consistent
snapshot of the database from the beginning to the end of the read operation.
• Eventually Consistent
Column-based
• Based on Google’s BigTable paper,
- Tables similarly to RDBMS, but handle semi-structured
• Data model:
- Collection of Column Families: (key, value) where value = set of related columns (standard, super)
- indexed by row key, column key and timestamp
• One column family can have variable numbers of columns
- Cells within a column family are sorted “physically”
• Query on multiple tables
- RDBMS: must fetch data from several places on disk and glue together
- Column-based NOSQL: only fetch column families of those columns that are required by a query
(all columns in a column family are stored together on the disk, so multiple rows can be retrieved in one read operation → data locality)
- Ex :BigTable, Cassandra, Hbased
Graph-based
• Focus on modeling the structure of data (interconnectivity)
- Scales to the complexity of data
• Data model: (Property Graph) nodes and edges
- Nodes may have properties (including ID)
- Edges may have labels or roles
- Key-value pairs on both
• Interfaces and query languages vary
- Ex: Neo4j, FlockDB, Pregel, InfoGrid
What is Neo4j ?
- Open source- Java-based - Stores data as nodes and relationships
- Neo4j is a NoSQL Graph Database - Full ACID transactions
- Database full of linked nodes - Schema free, bottom-up data model design
• Nodes represent entities, Edges represent relationships
- Connections between data are explored
- Faster for associative data sets
- Intuitive
• Optimal for searching social network data
- Strengths:Powerful data modelFast→ For connected data, can be many orders of magnitude faster than RDBMS
- Weaknesses:Sharding→ Though they can scale reasonably well-And for some domains you can shard too!
Schemaless Databases
• Easily store whatever you need, and change your data storage as you learn more about your project
• Easily add new things as you discover them
• Easier to deal with non-uniform data: data where each record has a different set of fields (limiting sparse data storage)
Conclusion
• NOSQL database cover only a part of data-intensive cloud applications (mainly Web applications)
• NoSQL databases is an approach to data management that is useful for very large sets of distributed data
• NoSQL should not be misleading: the approach does not prohibit Structured Query Language (SQL)
• And indeed they are commonly referred to as “NotOnlySQL”
Example: (MongoDB) document
{Name:"Jaroslav",
Address:"Malostranske nám. 25, 118 00 Praha 1”,
Grandchildren: {Claire: "7", Barbara: "6", "Magda: "3", "Kirsten: "1", "Otis: "3", Richard: "1“}
Phones: [ “123-456-7890”, “234-567-8963” ]
}
Document-Based : MongoDB
• Document-oriented NOSQL systems typically store data as collections of similar documents.
- Although the documents in a collection should be similar, they can have different data elements (attributes), and new
documents can have new data elements that do not exist in any of the current documents in the collection.
- The system basically extracts the data element names from the self-describing documents in the collection.
• A popular language to specify documents in NOSQL systems is JSON (JavaScript Object Notation).
- A document database is a type of NoSQL database which stores data as JSON documents instead of columns and rows.
- JSON is a native language used to both store and query data.
- These documents can be grouped together into collections to form database systems.
Example of a document that consists of 4 key value pairs:
• Using JSON enables developers to store and query data in the same document-model format.
• The model can be converted into other formats, such as JSON, BSON and XML.
Benefits of Document Databases
- Everything is available in a single database, rather than having information spread across several linked databases.
- Get better performance compared to an SQL database. {
- Unlike in conventional databases where a field exists for each piece of information, "ID" : "001",
even if there's nothing in it, a document-store is more flexible. "Book" : "Java: The Complete Reference",
- Since it’s more flexible, compared to a relational database, any new type of "Genre" : "Reference work",
information must NOT be added to all datasets. "Author" : "Herbert Schildt",
Conventional Table Vs Document-based database }
- Document database is a type of non-relational database that is designed to store and query data as JSON-like documents.
- The document model works well with use cases such as catalogs, user profiles, and content management systems where each
document is unique and evolves over time.
- Document databases enable flexible indexing, powerful ad hoc queries, and analytics over collections of documents.
Document Databases Use Cases→ Book Database
• The relational approach would represent the relationship between books and authors via tables with IDs (PK & FK).
By comparison, the document model shows relationships more naturally and simply by ensuring that each author document has
a property called Books, with an array of related book documents in the property. When you search for an author, the entire book
collection appears.
• Content Management: Developers use document databases to created video streaming platforms, blogs and similar services.
Each file is stored as a single document and the database is easier to maintain. Significant data modifications, such as data
model changes, require no downtime.
• Catalogs"may have thousands of attributes stored. In document databases, attributes related to a single product are stored in a
single document. Modifying one product's attributes does not affect other documents.
Best Document Databases
Amazon DocumentDB • MongoDB-compatible, Fully managed - High performance with low latency querying
• Strong compliance and security, and High availability - Amazon’s entire development team
• The BBC uses it for querying and storing data from multiple data streams
• Rappi switched to Amazon DocumentDB to reduce coding time.
MongoDB • Ad hoc queries - Optimised indexing for querying, Sharding, Load-balancing
• Forbes decreased build time by 58%, gaining a 28% increase in subscriptions due to quicker
building of new features,
• Toyota found it much simpler for developers to work at high speeds by using natural JSON
documents. More time is spent on building the business value instead of data modelling.
Cosmos DB • Any scale fast reads - 99,999% availability - Serverless, cost-effectively/instantly scales
• Coca-Cola gets insights delivered in minutes, facilitating global scaling.
• Before migrating to Cosmos DB, it took hours.
• ASOS needed a distributed database that flexibly and seamlessly scales to handle over 100
million global retail customers.
MongoDB {
title : ‘‘MongoDB’’,
• A database resides on a MongoDB server last editor : ‘‘172.5.123.91’’ ,
• A MongoDB database consists of collections of documents last modified : new Date (‘‘9/23/2010’’) ,
body : ‘‘MongoDB is a ...’’,
• Schema-free, (documents in a collection may be heterogeneous) categories : [‘‘Database’’, ‘‘NoSQL’’,
• Main abstraction and data structure is a document ‘‘Document Database’’],
reviewed : false
MongoDB Data Model }
• Individual documents are stored in a collection has no schema
• Documents in a collection should be similar, they can have different data elements (attributes)
• Self-describing data - Users can request to create indexes on some of the data elements
• Documents are stored in BSON (binary JSON) format, more efficient than JSON
Schema Free
• MongoDB does not need any pre-defined data schema, Every document in a collection could have different data
• Addresses NULL data fields
JSON format
• Data is in name/value pairs consists of a field name followed by a colon, followed by a value:
- Example: “name”: “R2-D2”
• Data is separated by commas → Example: “name”: “R2-D2”, race : “Droid”
• Curly braces hold objects → Example: {“name”: “R2-D2”, race : “Droid”, affiliation: “rebels”}
• An array is stored in brackets [ ] → [{“name”: “R2-D2”, race : “Droid”, affiliation: “rebels”}, {“name”: “Yoda”, affiliation:
“rebels”} ]
MongoDB CRUD Operations
• Documents created and inserted using the insert operation:
- db.<collection_name>.insert(<document(s)>)
• The delete operation is called remove, and the format is:
- db.<collection_name>.remove(<condition>)
• The update operation, which has a condition to select certain documents, and a $set clause to specify the update.
• For read queries, is called find, and the format is:
- db.<collection_name>.find(<condition>)
CRUD operations
Create db.collection.insert( <document> )
db.collection.save( <document> )
db.collection.update( <query>, <update>, { upsert: true } )
Read db.collection.find( <query>, <projection> )
db.collection.findOne( <query>, <projection> )
Update db.collection.update( <query>, <update>, <options> )
Query Operators
Name Description
$eq Matches value that are equal to a specified value
$gt, $gte Matches values that are greater than (or equal to a specified value
$lt, $lte Matches values less than or ( equal to ) a specified value
$ne Matches values that are not equal to a specified value
$in Matches any of the values specified in an array
$nin Matches none of the values specified in an array
$or Joins query clauses with a logical OR returns all
$and Join query clauses with a loginal AND
$not Inverts the effect of a query expression
$nor Join query clauses with a logical NOR
$exists Matches documents that have a specified field
MongoDB Example
• Create a collection named mycoll with 10,000,000 bytes of preallocated disk space and no automatically
generated and indexed document-field
- db.createCollection(‘‘mycoll’’, size: 10000000, autoIndexId: false)
• Add a document into mycoll
- db.mycoll.insert(title: ‘‘MongoDB’’, last editor: ... )
• Retrieve a document from mycoll
- db.mycoll.find(categories: [‘‘NoSQL’’, ‘‘Document Databases’’])
To populate the inventory collection
• Examples of query operations using the db.collection.find() method in the mongo shell.
• The examples use the inventory collection.
db.inventory.insertMany([
{ item: "journal", qty: 25, size: { h: 14, w: 21, uom: "cm" }, status: "A" },
{ item: "notebook", qty: 50, size: { h: 8.5, w: 11, uom: "in" }, status: "A" },
{ item: "paper", qty: 100, size: { h: 8.5, w: 11, uom: "in" }, status: "D" },
{ item: "planner", qty: 75, size: { h: 22.85, w: 30, uom: "cm" }, status: "D" },
{ item: "postcard", qty: 45, size: { h: 10, w: 15.25, uom: "cm" }, status: "A" }
]);
Query Documents
• To select all documents in the collection: db.inventory.find({})
• Selects from the inventory collection all documents where the status equals "D": db.inventory.find( { status: "D" } )
MongoDB Distributed Systems Characteristics
• Most MongoDB updates are atomic if they refer to a single document, but MongoDB also provides a pattern for
specifying transactions on multiple documents.
• Replication in MongoDB. The concept of replica set is used in MongoDB to create multiple (3) copies of the same
data set on different nodes in the distributed system.
• All write operations must be applied to the primary copy and then propagated to the secondaries.
• For read operations, the user can choose.
• The default read preference processes all reads at the primary copy, so all read and write operations are performed
at the primary node.
• Sharding in MongoDB. Sharding of the documents in the collection—also known as horizontal partitioning—
divides the documents into disjoint partitions known as shards.
• This allows the system to add more nodes as needed and to store the shards of the collection on different nodes to
achieve load balancing.
• Sharding and replication are used together; sharding focuses on improving performance via load balancing and
horizontal scalability, whereas replication focuses on ensuring system availability when certain nodes fail in the
distributed system.
Query Operators
Name Description
$eq Matches value that are equal to a specified value
$gt, $gte Matches values that are greater than (or equal to a specified value
$lt, $lte Matches values less than or ( equal to ) a specified value
$ne Matches values that are not equal to a specified value
$in Matches any of the values specified in an array
$nin Matches none of the values specified in an array
$or Joins query clauses with a logical OR returns all
$and Join query clauses with a loginal AND
$not Inverts the effect of a query expression
$nor Join query clauses with a logical NOR
$exists Matches documents that have a specified field
Create Collection
• db.createCollection(“worker”, { capped : true, size : 5242880, max : 2000 }))
• db.createCollection(“project”, { capped : true, size : 1310720, max : 500 })
find
• db.collection_name.find()
• db.collection_name.findOne({query}, {projection})
• db.collection_name.find().sort({name:1/-1})
• db.collection_name.find({parts: “hammer”}).limit(5)
update
• db.collection_name.update({query}, {update}, {upsert: true})
• db.collection_name.updateMany({query},{update})
• db.collection_name.updateOne({name:”marwn”}, {$set:{fulltime:true}})
Delete
• db.collection_name.delete({query}, {update})
• db.collection_name.deleteMany({query}, {update})
• db.collection_name.deleteOne({name:”marawn”}, {$set:{fulltime:true}})
• db.collection_name.save(<document>)
• db.collection_name.findAndModify(<query>, <sort>,<update>,<new>, <fields>,<upsert>)
• db.collection_name.remove(<query>, <justone>)
• db.collection_name.remove(type: /^h/})
• db.collection_name.remove()
• db.collection_name.save(<document>)
Note:
• Db → Collections → Documents → bison
mongosh establish connection
use mydb Create db
show dbs Show all dbs
show collections Show all collections
exit exit command
cls clear command
db.dropDatabase() Delete Database
db.createCollection() Create Collection
db.createCollection(“students”,{capped:true, size:10000, max:100},{autoIndexId:false}) Advanced Create Collection
db.coll_name.insertOne({}) insert document
db.coll_name.insertMany([{},{},{}]) Insert many document
find({criteria})
db.coll_name.find({name:”Marwan Magdy”}) Find document
db.coll_name.find({age:{$lt:20}}) Less than
db.coll_name.find({age:{$gte:20}}) Greater than
db.coll_name.find({gpa:{$gte:3, $lt:4}}) Grater than or less than
db.coll_name.find({name:{$in:[“”,””]}}) Equal in the array
db.coll_name.find({},{_id:false, name:true})
db.coll_name.find().sort({gpa:1}).limit(5) ascending with limiting results
Logical operator
db.coll_name.find($and:[{fullTime:true},{age:{$gte:20}}])
db.coll_name.find({age:{$not:{$gte:20}}})
updateOne({filter},{update})
db.coll_name.updateOne({name:”Marwan Magdy”},{$set:{fullTime:true}})
db.coll_name.updateOne({name:”Marwan Magdy”},{$unset:{fullTime:””}}) Remove attribute from document
db.coll_name.updateMany({fullTime:{$exists:false},{$set:{fullTime:false}})
db.coll_name.updateMany({},{$set:{fullTime:true}}) Update all documents attributes
deleteOne({filter})
db.coll_name.deleteOne({name:”Marwan Magdy”}) Delete one document
deleteMany({criteria})
db.coll_name.deleteMany({name:true}) Delete many document
db.coll_name.deleteMany({name:{$exists:false}}) Delete many document with criteria
Data Warehousing & Business Intelligence
Business Intelligence
• Used by decision makers to get a comprehensive knowledge of the business and to define and support their business strategies.
• Enable data-based decisions aimed at: gaining competitive advantage, improving operative performance, responding more
quickly to changes, increasing profitability, and, in general, creating added value for the company.
Basic BI Architectural Components
ETL Tools • Data extraction, transformation, and loading (EIL) tools collect, filter, integrate, and aggregate internal and
external data to be saved into a data store optimized for decision support.
Data Store • Optimized for decision support and is generally represented by a data warehouse or a data mart.
• The data is stored in structures that are optimized for data analysis and query speed .
Query and Reporting • Performs data selection and retrieval, and it is used by the data analyst to create queries that access the
database and create the required reports.
Data Monitoring and • Allows real-time monitoring of business activities, in a single integrated view.
• Alerts can be placed on a given metric; once the value of a metric goes below or above a certain baseline, the
Alerting
system will perform a given action, such as emailing shop floor managers,
Data Analytics • Performs data analysis and data-mining tasks using the data in the data store.
• Advises the user how to build a reliable business data model that identify and enhance the understanding of
business situations and problems
Data Visualization • Presents data to the end user in a variety of meaningful and innovative ways.
• Helps the end user Select the most appropriate presentation format, such as summary reports, maps, pie or
bar graphs, mixed graphs, and static or interactive dashboards
Sample of Business Intelligence Tools
Dashboards and • Web-based technologies to present key business performance indicators or information in a single integrated
view, generally using graphics that are clear, concise, and easy to understand.
Business Activity
- Salesforce - IBM/Cognos – iDashboards - Tableau
Data Warehouses • Foundation of a BI infrastructure. Data is captured from the production system and placed in the DW.
- IBM/Cognos – Oracle - Microsoft
OLAP • OLAP tools Online analytical processing provides multidimensional data analysis.
- IBM/Cognos – Oracle - Microsoft
Data-Mining • Provide advanced statistical analysis to uncover problems and opportunities hidden within business data.
- MS Analytics Services - Orange
Data Visualization • Provide advanced visual analysis and techniques to Enhance understanding and create additional insight of
business data and its true meaning.
- Dundas – Tableau – QlikView - Microsoft PowerBI
Business Intelligence Benefits
1. Improved decision making
2. Integrating Architecture → Integrating umbrella for a disparate mix of IT systems within an organization.
3. Common user interface for data reporting and analysis → Provide information using a common interface for all company users.
4. Common data repository fosters single version of company data
5. Improved organizational performance → Provide advantages in many different areas, that can be reflected in added efficiency, reduced
waste, increased sales, and reduced employee and customer turnover.
Operational Vs. Analytical Information
Operational • Information collected and used in support of day to day operational needs in businesses and other orgs.
information Data Makeup Differences Technical Differences Functional Differences
• Typical Time-Horizon: Days/Months • Small Amounts used in a Process • Used by all types of employees
(transactional)
• Detailed • High frequency of Access • for tactical purposes
• Current • Can be Updated • Application Oriented
• Non-Redundant
Different functions • Decision support requires historical data which operational DBs do not maintain
and different data • consolidation (aggregation, summarization) of data from heterogeneous sources
Data Marts
• Departmental small-scale “DW” that stores only limited/relevant data.
• Provides decision support to a small group of people.
• Could be created from data extracted from a larger data warehouse for the specific purpose of supporting faster data access to
a target group or function.
Data Warehousing Modelling
ER • Predominant technique for visualizing database requirements,
modeling • used extensively for Conceptual modeling of operational databases
Dimensional • Modeling technique tailored specifically for modeling data warehouses and data marts
modeling • Used for designing Subject-oriented analytical database.
• Logical design technique aims to present the data in a standard, intuitive form that allows for high-performance access.
• In addition to using the regular relational concepts (primary keys, foreign keys, integrity constraints, etc.) dimensional
modeling distinguishes two types of tables:
Fact Table • One table with a composite primary key , Fields Holding Foreign Keys to dimensions’ tables
• Fields holding the Measures related to the subject of analysis are typically numeric and are intended
for mathematical computation and quantitative analysis
Dimensions • Each dimension table has a non-composite primary key that corresponds exactly to one of the
components of the composite key in the fact table.
• Set of Smaller tables, Contain descriptions of the business
• Concepts that the Manager use to analysis the market’s feature of interest (time, location, Customer, product).
OLAP Operations
Slice • Select and project on one dimension. Subset of a multidimensional array
• Operations on a Simple Tree-Dimensional Data Cube
Dice • Select and project on two or more dimensions. Slice on more than two dimensions
Slice and Dice • Adds, replaces, or eliminates specified dimension attributes
• (or particular values of the dimension attributes) from the already displayed result
Roll-up • Decrease the aggregate level to view data at a higher level of summarization. (Removing one or more dimension)
• Computing all of the data relationships for one or more dimensions
Drill-down • Increase the aggregate level to view data in more detail. (Adding one or more dimension)
• Navigating among levels of data ranging from the most summarized (up) to the most detailed (down)
• Makes the granularity of the data in the query result finer
Pivot • Re-orient the multidimensional view of data. Allows to rotate a 2D/3D cube’s axis.
• Used to change the dimensional orientation of a report or an ad hoc query-page display
• Reorganizes the values displayed in the original query result by moving values of a dimension column
from one axis to another
Cube Storage
• When designing SQL Server OLAP services cubes, the most two important decisions are:
- The choice of the level of aggregation
- The choice of storage mode
Aggregation
• OLAP Services, aggregations are pre-calculated sums of fact table data at some combination of levels from each dimension.
• Used to answer queries.
• Pre-calculating all possible aggregates would greatly increase the storage requirements for a database.
• At the other extreme, calculating all the aggregates at query time would result in slow query response time.
• When choosing the amount of aggregation to include, there is a tradeoff between the storage space and query response time.
OLAP Servers : Storage Modes
Relational OLAP • Fact data and aggregations are stored in the relational database server.
• Provides OLAP functionality by using relational databases and familiar relational query tools to store and
(ROLAP)
analyze multidimensional data.
ROLAP adds the following extensions to traditional RDBMS technology:
• Multidimensional data schema support (Star schema)
• Data access language and query performance optimized for multidimensional data
• Support for very large databases (VLDBs)
• Gives the appearance of traditional OLAP’s slicing and dicing functionality.
✓ Can handle a large amount of data and can leverage all the functionalities of relational DB.
Performance is slow and each ROLAP report is an SQL query with all the limitations of the genre.
It is also limited by SQL functionalities.
Multidimensional OLAP • Fact data and aggregations are stored on the OLAP server in an optimized multidimensional format.
• This is the traditional mode in OLAP analysis.
(MOLAP)
• Data is stored in form of multidimensional cubes and not in relational databases.
✓ Provides excellent query performance and the cubes are built for fast data retrieval.
✓ Calculations are pre-generated when cube is created and can be easily applied while querying data.
Handle only limited amount of data.Since calculations have been pre-built when the cube was created,
Cube cannot be derived from a large volume of data.
Velocity • in motion rather than data “at rest” as found in traditional database.
• Constantly being transmitted through computer networks, sensors, mobile devices, and satellites.
• Processing big data involves the analysis of streaming data arrives at a fast rate.
Variety • Different forms of data highly heterogeneous textual, video, XML, or sensor.
Veracity • Accuracy of the data being analyzed. Data can be corrupted in transmission, or can potentially be lost.
• Before analyzing any large data set, steps must be taken to improve the trustworthiness of the data.
Value • Advantages that the successful analysis of big data can bring to businesses, organizations, and research labs.
Hadoop
• System for the efficient Storage and Retrieval of large data files, Based on a Simple data model any data will fit.
Distributed file system • The Hadoop storage framework
(HDFS) - Single namespace for entire cluster
- Replicates data 3x for fault-tolerance
MapReduce • Parallel Programming Paradigm
implementation - Executes user jobs specified as “map” and “reduce” functions
- Manages work distribution & fault-tolerance
Economical • It distributes the data and processing across clusters of commonly available computers (in thousands).
Efficient • By distributing the data, it can process it in parallel on the nodes where the data is located
Reliable • Automatically maintains multiple copies of data and redeploys computing tasks based on failures.
2. MapReduce Framework [introduced by Google]→ Code usually written in Java. Two fundamental components:
Map • Filter and/or transform the input data into a more appropriate form in preparation for the reduce step.
• Master node takes large problem and slices it into smaller sub problems; distributes these to worker nodes.
• Worker node may do this again if necessary.
• Worker processes smaller problem and hands back to master.
Reduce • Perform calculation or aggregation over the data that it receives from the map step to achieve the results.
• Master node takes the answers to the sub problems and combines them in a predefined way to get the output/answer to
original problem.
• Used to do a programming job that requires reading the entire file to perform computation.
• The computation is designed to operate in parallel on the individual blocks of the file and then merge the results.
• Programming model for expressing distributed computations at a massive scale
Per cluster node:
Single JobTracker per • Responsible for scheduling the jobs’ component tasks on the slaves and Monitor slave progress
master • Run on Master node , Re-execute failed tasks , Accepts Job requests from clients
Single TaskTracker per • Execute the task as directed by the master
slave • Run on slave nodes Forks separate Java process for task instances
Ex : Counting words given a large collection of documents, output the frequency for each unique word.
1. When you put this data into HDFS, Hadoop automatically splits into blocks and replicates each block.
2. Input reader reads a block and divides into splits.
3. Each split would be sent to a map function. E.g., a line is an input of a map function.
4. The key could be some internal number, the value is the content of the textual line.
5. Mapper takes the output generated by input reader and output a list of intermediate <key, value> pairs.
6. There is shuffle/sort before reducing.
7. Reducer takes the output generated by the Mapper, aggregates the value for each key, and outputs the final result.
Execution Details
Input reader • Divide input into splits, assign each split to a Map task, Input reader reads a block and divides into splits.
• Each split would be sent to a map function. E.g., a line is an input of a map function.
• The key could be some internal number, the value is the content of the textual line.
Map task • Apply the Map function to each record in the split, Each Map function returns a list of (key, value) pairs
Shuffle/Partition and • Shuffle distributes sorting & aggregation to many reducers
Sort • All records for key k are directed to the same reduce processor
• Sort groups the same keys together, & prepares for aggregation
Reduce task • Apply the Reduce function to each key, The result of the Reduce function is a list of (key, value) pairs
Limitation
• Have to use M/R model
• Not Reusable
• Error prone
• For complex jobs → Multiple stage of Map/Reduce functions