Database
Database
=> Atomicity
- means TA is one unit of work and can not be split.
- All queries in a TA must succeed
- if one query fails, all prior successful queries in the TA should rollback
- if the DB went down prior to a commit of TA, all the successful queries in
the TA should rollback, so when DB back up it should
rollback and clear the prior commits
- Lack of atomicity led to inconsistency
=> Isolation
- For the DB we can have multiple TCP connections and all might have there
on TA started, can my inflight TA see changes made by other TA?
- Lost Updated -> U wrote something in a TA but did not commit and some
other TA updated that value u wrote, now if u try to
read the value u wrote, it not there. So it is lost
update
- Isolation - Isolation levels for inflight TA(The SQL standard defines four
levels of transaction isolation.The most strict is Serializable)
- Serialisable -> which is defined by the standard in a paragraph which
says that any concurrent execution of a set of Serializable transactions
is guaranteed to produce the same effect as running
them one at a time in some order.It is slowest.
(The other three levels are defined in terms of phenomena, resulting from
interaction between concurrent transactions)
- Read uncommitted -> No isolation, any change from outside is visible to
the TA, committed or not.
All read Phenomena can happen may happen
- Read committed -> Each query in a TA only sees committed changes by other
TA, means if u have a long running TA and some other TA commit something
ur TA can read that. All read Phenomena can happen
except dirty read
- Repeatable Read -> The TA will make sure that when a query reads a row,
that row will remain unchanged while it is running.(RR means I will read
same value in whole TA). It can not get read
of phantom read may happen but not in POSTGRES DB
- Snapshot -> Each query in a TA only sees changes that have been committed
up to the start of TA. It's like snapshot version of DB at that moment.
This guarantees to get rid of all read phenomena.
- DB implementation of Isolation
- Each DBMS implements Isolation level differently
- Pessimistic - Row level locks, table locks, page locks to avoid lost
updates
- Optimistic - No locks, just track if things changed and fail the TA if so
- Repeatable read locks the rows it reads but it could be expensive if u
read lot of rows, Postgres implements RR as snapshot. That is why you don't
get phantom reads with Postgres in RR.
=> Where and How Postgres store the data into memory
(https://www.udemy.com/course/sql-and-postgresql/learn/lecture/22802643#overview)
Watch whole playlist.
=> Heap or Heap File -> File that contains all the data(rows) of our table
=> Tuple or Item -> Individual row from the table
=> Block or Page -> The heap file is divided into many different blocks or
pages. Each pages/block store some number of rows and size of page is 8KB.
=> Heap / Hard-drive -> The heap is data structure where the table is stored
with all its page one after another
-> This is where actual data is stored including everything
-> Traversing the heap is expensive as we need to read so many data to
find what we want
-> That is why we need indexes that help tell us exactly what part of
heap we need to read.
=> Index -> An index is another data structure separate from heap that has
pointer to heap.
-> It has part of data and used to quickly search for something
-> You can index on one column or more
-> Once you find a value of the index, you go to heap to fetch more
information where everything is there.
-> Index tells u exactly which page to fetch in the heap instead of
taking the hit to scan every page of heap
-> The index is also store as page and cost IO to pull the entries of
the index.
-> The smaller the index, the more it can fit in memory the faster the
search.
-> The popular data structure for index is b-trees.
=> INDEXES(Watch Database Indexing YouTube video from coding and concepts channel)
=> Index is a DS that u build and assign on top of a existing table, what is
does is basically look through ur table and try to analyse
and summarise so that it can create a shortcut to access the data into table
=> It is used to increase the performance of the DB query, so data can be
fetched faster. Without indexing DB has to iterate each and every
table row to find the requested data.
=> For PK indexes are automatically created and for uniques constraints also
index automatically created and they do not show in pgAdmin.
=> DBMS created data pages(generally 8KB but depends upon DB to DB). Each page
can store multiple rows
=> Page -> Depending on the storage model(row vs column store), the rows are
stored and read in logical pages
-> The DB doesn't read a single row, it reads a page or more in a
single IO and we get a lot of rows in that IO.
-> Pages are fixed sized memory location in disk
-> Each page has a size(8KB in Postgres, 16KB in mysql)
=> A single page is of 8KB, all 8KB is not used to store table info or data.
Some bytes are used for headers and offsets, remaining use for actual data.
eg. 8KB = 8192bytes, assume 96KB is assigned for header to store meta-data
about page like PageNo, how much free space is available.
36KB is assigned to offset or footer, contains array, each index of array
holds a pointer to corresponding data in data record of same header,
remaining 8060 bytes is for actual data record. Now assume a row of size
125bytes then a single page can hold 8060 / 125 rows init
=> DBMS creates and manage the data pages. As for 1 table data, it can create
many data pages. These data pages ultimately get stored in the data
blocks in physical memory.
=> Data Block -> Data block is the minimum amount of data which can be
read/write by an I/O operation.
-> It is manage by underlying storage system like disk. Data
block can range from 4Kb to 32Kb(common size if 8KB)
-> So based on the data block size, it can hold 1 or many data
page.
=> Now DBMS create data page which get stored in data block, and all data pages
stored randomly store in different data pages. Now DBMS manage
mapping of dataPages to corresponding data block.Remember DBMS controls
data pages(like which rows goes in which data page or sequence of pages)
but has no control on data blocks(data blocks can be scattered over disk)
eg. DataPage1 => Data Block 1
DataPage2 => Data Block 1
DataPage3 => Data Block 2
DataPage4 => Data Block 3
=> B+ Tree
=> If a table has million rows then query can take upto O(N) to fetch data.
Which data structure provides better time complexity. B+ tree, it
provides O(log N) for insertion, searching and deletion.
=> B+ tree are self balancing tree. It maintains sorted data, all leaf are at
same level
=> M order B tree means, each node can have at most M children's and M-1
keys per node
=> B tree and B+ trees are same except in B+ trees all leaf node are
connected.
=> DBMS uses B+ trees to manage its data pages and rows within Pages.(Watch
concept and coding index after 50 min)
-> The root node or intermediary node hold the value which is used for
faster searching of data. Possible that value might
deleted from DB, but it is used for sorting the tree
-> Leaf node actually holds the indexed column value of table.
-> With the help of B+ tree, DBMS decide which rows goes to which data
page to efficiently manage/search the data.
=> Index Type (Watch concept and coding index after 1h 5min)
=> Cluster Indexing -> Clustered indexes are the unique index per table that
uses the primary key to organize the data that is within the table.
The clustered index ensures that the primary key
is stored in increasing order, which is also the order the table holds in memory.
Clustered indexes have to be explicitly declared
in the case of Postgres. Created when the table is created.
Use the primary key sorted in ascending order.
-> Order of rows inside the data pages, match with
the order of indexing.
-> Offset manage the pointer to data in such a way it
manage the indexing sorted sequence.
-> If manually u have not provided any cluster index,
dbms assume PK as cluster key.
-> If there is no PK available then dbms create
internal hidden column which is used as cluster index
(this column increase sequentially and will not
be null)
=> Sometime the heap table can be organised around a single index. This is
called a clustered index or index organised table
- PK is usually a clustered index unless otherwise specified
- Mysql InnoDB always have a PK, other indexes point to the PK value
- Postgres only have secondary indexes and all indexes point directly to
row_id which lives in heap