DBMS Unit4 Notes
DBMS Unit4 Notes
Unit 4 Notes
MySQL connector
There are many modules which helps us to connect python to
mysql. We are going to use mysql connector in our lesson.
• Mysql connector python is a module or library available in python
to communicate with MySQL.
Prerequisites
▪ You need root or administration privileges to perform the
installation process.
▪ Python 2 or 3 should be installed in your system. And make sure
it is in your system’s path.
▪ Mysql server should be installed and running.
▪ Mysql version should be greater than 4.1
• If you are facing any issue while installing, please mention the
version of the module and then try again.
pip install mysql-connector-python==8.0.11
Semistructured Data
• Semistructured data is a type of data that does not conform to a
rigid, predefined structure like structured data but still
possesses some level of organization or hierarchy.
• Unlike structured data, which is typically stored in relational
databases with well-defined schemas, semistructured data
allows for flexibility in terms of data formats and attributes.
• Displaying semi-structured data as directed graph.
• The labels or tags on the directed edges represent the schema
names: the names of attributes, object types and relationships.
• Internal nodes represent individual object or composite
attributes.
• Leaf nodes represent actual atomic values.
Unstructured Data
• Unstructured data refers to data that does not have a predefined
structure or format.
• It lacks a clear and organized schema, making it more
challenging to analyze and process compared to structured data.
• Ex: audio, text documents, web pages
NO SQL systems
CAP THEOREM
COLLECTION
RDBMS MongoDB
Not suitable for hierarchical data Suitable for hierarchical data storage.
storage.
It is vertically scalable i.e increasing It is horizontally scalable i.e we can add
RAM. more servers.
It is row-based. It is document-based.
Creating a collection
Example -
db.project.insert( { _id: “P1”, Pname: “ProductX”, Plocation:
“Brooklyn” } )
For read queries, the main command is called find, and the format is:
db.<collection_name>.find(<condition>)
General Boolean conditions can be specified as <condition>, and the
documents in
the collection that return true are selected for the query result.
db.employee.findOne()
db.employee.find({“empid”:2})
db.employee.find({“age”:{$lt:”30”}})
Query: Return the employees born after ‘1995-08-07’ from
employee collection
db.employee.find({“skill”:”Java”})
db.employee.find({“skill”:”Java”,”salary”:”75000”})
db.employee.find({$or:[{“skill”:”Java”},{”salary”:”75000”}])
db.employee.find({“skill”:”Java”}, $or:[{”salary”:”80000”},
{”salary”:”95000”}])
db.employee.find({},{“firstname”:1})
db.employee.find({},{“firstname”:1,”_id”:0})
UPDATE OPERATION
db.employee.update({“skill”:”mongodb”},{$set:{“salary”:”1000000”}})
db.employee.update({“skill”:”mongodb”},{$set:{“salary”:”1000000”}})
{multi:true}
DELETING A COLLECTION
Replication in MongoDB
• The concept of replica set is used in MongoDB to create multiple
copies of the same data set on different nodes in the distributed
system
• It uses a variation of the master-slave approach for replication.
Replication in MongoDB.
• The total number of participants in a replica set must be at least
three, so if only one secondary copy is needed, a participant in
the replica set known as an arbiter must run on the third node
N3.
• For read operations, the user can choose the particular read
preference for their application. The default read preference
processes all reads at the primary copy, so all read and write
operations are performed at the primary node.
• In this case, secondary copies are mainly to make sure that the
system continues operation if the primary fails, and MongoDB
can ensure that every read request gets the latest document
value.
Sharding
• Sharding is a method for distributing or partitioning data across
multiple machines.
• It is useful when no single machine can handle large modern-day
workloads, by allowing you to scale horizontally.
• Horizontal scaling, also known as scale-out, refers to adding
machines to share the data set and load. Horizontal scaling
allows for near-limitless scaling to handle big data and intense
workloads.
•
Sharding in MongoDB
• When a collection holds a very large number of documents or
requires a large storage space, storing all the documents in one
node can lead to performance problems, particularly if there are
many user operations accessing the documents concurrently
using various CRUD operations.
• The values of the shard key are divided into chunks through
range partitioning or hash partitioning, and documents are
partitioned based on the chunks of shard key values.
Range Partitioning
• For example, if the shard key values ranged from one to ten
million, it is possible to create ten ranges—1 to 1,000,000;
1,000,001 to 2,000,000; ... ; 9,000,001 to 10,000,000—and each
chunk would contain the key values in one range.
Hash Partitioning
• Hash partitioning applies a hash function h(K) to each shard key
K, and the partitioning of keys into chunks is based on the hash
values.
Introduction
● Voldemort is an open source system available through
Apache 2.0 open source licensing rules. It is based on
Amazon’s DynamoDB.
● High performance, high scalability, high availability (i.e
replication, sharding, horizontal scalability) are realized
through a technique to distribute the key-value pairs among
the nodes of a distributed cluster known as consistent
hashing.
● Voldemort has been used by LinkedIn for data storage.
Features
1. Simple basic operations:
● Collection of key-value pairs is kept in a store(s).
● 3 operations: get, put, delete
○ s.get(k) : retrieves the value v associated with key k.
○ s.put(k, v) : inserts an item as a key-value pair with key k
and value v.
○ s.delete(k) : deletes the item whose key is k from the store.
● At the basic storage level, both keys and values are arrays of
bytes (strings).
Introduction
● Another category of NOSQL systems is known as graph
databases or graph-oriented NOSQL systems.
● Data is represented as a graph - collection of
nodes(vertices) and edges.
● Nodes and edges can be labeled to indicate the types of
entities and relationships they represent.
● Possible to store data associated with both individual
nodes and individual edges.
● Many systems can be categorized as graph databases.
● We will focus our discussion on one particular system,
Neo4j, which is used in many applications.
● Neo4j is an open source system, and it is implemented in
Java.
Neo4j Data Model
● Neo4j organizes data using the concepts of nodes and
relationships.
● Both nodes and relationships can have properties, which
store the data items associated with nodes and
relationships.
● Nodes can have labels (≥0) and the ones which have the
same label are grouped into a collection that identifies a
subset of the nodes in the database graph for querying
purposes.
● Relationships are directed; each relationship has a start
node and end node as well as a relationship type, which
serves a similar role to a node label by identifying similar
relationships that have the same relationship type.
● Properties can be specified via a map pattern, which is
made of one or more “name : value” pairs for example
{Lname : ‘Smith’, Fname : ‘John’, Minit : ‘B’}.
● To create nodes and relationships in Neo4j we use the
Neo4j CREATE command, which is part of the high-level
declarative query language Cypher.
Neo4j Data Model
● Nodes in Neo4j correspond to entities
● Node labels correspond to entity types and subclasses
● Relationships correspond to relationship instances
● Properties correspond to attributes
Neo4j ER Model
Node may have no label in Neo4j. Every entity must belong to an entity
type, so a node must have a label.
Query Processing
Introduction
• Thus, the first action the system must take in query processing is
to translate a given query into its internal form. This translation
process is similar to the work performed by the parser of a
compiler.
1. Evaluation
With 4 KB blocks:
• The cost estimates do not include the cost of writing the final
result of an operation back to disk. These are taken into account
separately where required.
• In the worst case, we may assume that the buffer can hold only a
few blocks of data.
Selection Operation
• In query processing, the file scan is the lowest-level operator to
access data.
• File scans are search algorithms that locate and retrieve records
that fulfill a selection condition.
• The union of all the retrieved pointers yields the set of pointers to
all tuples that satisfy the disjunctive condition. We then use the
pointers to retrieve the actual records.
Sorting
Introduction
● In the first stage, a number of sorted runs are created; each run
is sorted but contains only some of the records of the relation.
● In the second stage, the runs are merged. Suppose, for now, that
the total number of runs, N, is less than M, so that we can
allocate one block to each run and have space left to hold one
block of output. The merge stage operates as follows:
● The output of the merge stage is the sorted relation. The output
file is buffered to reduce the number of disk write operations.
● Cost of seeks
○ During run generation: one seek to read each run and
one seek to write each run
○ 2⌈br∕M⌉
Join Operation
● In the best case, there is enough space for both relations to fit
simultaneously in memory, so each block would have to be read
only once; hence, only br + bs block transfers would be required,
along with two seeks.
● We can use the nested loops to compute the join; assume that
student is the outer relation and takes is the inner relation in the
join.
● If we had used takes as the relation for the outer loop and
student for the inner loop, the worst-case cost of our final
strategy would have been 10,000 ∗ 100 + 400 = 1,000,400
block transfers, plus 10,400 disk seeks.
● The number of block transfers is significantly less, and
although the number of seeks is higher, the overall cost is
reduced, assuming tS = 4 milliseconds and tT = 0.1
milliseconds.
Merge Join
● A group of tuples of one relation with the same value on the join
attributes is read into Ss.
● Then, the corresponding tuples (if any) of the other relation are
read in and are processed as they are read.
● Figure 15.8 shows two relations that are sorted on their join
attribute a1.
● It is instructive to go through the steps of the merge-join
algorithm on the relations shown in the figure.
● If there are some join attribute values for which Ss is larger than
available memory, a block nested-loop join can be performed for
such sets Ss, matching them with corresponding blocks of tuples
in r with the same values for the join attributes.
● Once the relations are in sorted order, tuples with the same
value on the join attributes are in consecutive order.
● Suppose that the relations are already sorted on the join attribute
ID. In this case, the merge join takes a total of 400+100 = 500
block transfers.
● The heuristic query optimizer will transform this initial query tree
into an equivalent final query tree that is efficient to execute.
2. Materialized Evaluation:
a. Intermediate Results: Materialized evaluation creates and
stores intermediate results in temporary tables. These
results can be indexed and optimized, potentially leading to
better query performance.
b. Higher Latency: Due to the need to store and retrieve
intermediate results, materialized evaluation can introduce
additional latency in query execution.
1. File Information: For each file, you need to know its size, which
includes the number of records (r), the average record size (R),
the number of file blocks (b), and the blocking factor (bfr).
2. File Organization: Understand the primary file organization,
which can be unordered, ordered by an attribute (with or without
an index), or hashed on a key attribute.
3. Index Information: Keep track of primary, secondary, or
clustering indexes and their indexing attributes. Note the number
of levels (x) in multilevel indexes and the number of first-level
index blocks (bI1) for certain cost functions.
4. Attribute Details: Collect data on the number of distinct values
(NDV) of an attribute in a relation R and the attribute selectivity
(sl), which is the fraction of records satisfying an equality
condition on the attribute. This helps estimate the selection
cardinality (s = sl * r), which is the average number of records
meeting an equality condition.