0% found this document useful (0 votes)
6 views

Data Search Merged

Uploaded by

Karan Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Data Search Merged

Uploaded by

Karan Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

Modern Application Development - I

Professor Nitin Chandrachoodan


Department of Electrical Engineering
Indian Institute of Technology, Madras
Data Search

Hello, everyone, and welcome to this course on Modern Application Development.

(Refer Slide Time: 00:16)

So, before getting to the different types of databases, it is worth spending a little bit of time
understanding how data search works.
(Refer Slide Time: 00:23)

And one of the things that I will be using over here is something called the big O notation(O()).
So, you might have come across this kind of notation, over here O and followed by parenthesis.
And in computer science, in theoretical computer science, this is sometimes referred to as the big
O notation, big O(O()) because there are also corresponding small o and a couple of other
different variants of this. The big O(O()) is what is most commonly used. And it roughly means
order of magnitude. So, what we are trying to convey when we use big O notation(O()) is the
order of complexity of the computation that you are trying to perform.

Now, without going into any details on big O notation(O()) or how it is used in practice, you will
need to do a proper course on algorithms to understand that. And hopefully, most of you have
either already done it or we will be doing it at some point. Over here I am only going to touch
upon some of the very basic concepts that we need for understanding some things over here. And
the things to keep in mind are one of them is O(1) essentially means constant time independent
of the input size.

So, in other words, if I have, let us say, records of 1 million users stored somewhere in a
database, but if I make a query about something, I do not know what, but if I make some kind of
a query and I am able to get back a response in O(1) time or constant time, that is excellent.
What it means is that the time required to respond did not depend on the number of users or the
number of entries in the database that is sort of the best-case scenario. You cannot really hope for
anything better than that. But it is also unrealistic, because anything which returns an answer
independent of the number of entries that are there in the database cannot really be doing very
useful work, because it is clearly not even looking at the data in some way.

Now, the next best, in some ways, is what is called O(log N). And when we say O(log N), what
we are saying is that, if I have n records sitting inside my database, I am able to, let us say,
retrieve one of them in log n number of steps or at least some constant times log n number of
steps. Which means that as the database grows in size, it grows up by a factor of, let us say, 10
times.

The time taken to answer my query only goes up by a small increment. It does not go up 10
times in particular. So, if the data, the number of data entries goes up by a factor of 10, the time
does not go up by a factor of 10. It goes up by something much less. This is also very good. I
mean, this is pretty much the best that you can hope for in practical scenarios.

O(N) is sort of the obvious thing. If I have a million entries, then how do I find one entry over
there? I search through each one until I hit the entry that I am looking for. So, that is sort of the
simple solution in most cases. In many instances, O(N) is probably good enough, but not for
things like database search.

After that we have the big Os (O()) that are worse than O (N). You can have O(Nk), where if k is
2, you would call it O(N2), which is quadratic. If k is 3, you would have O(N3), cubic and so on.
These are not particularly good. It means that the time required for performing an operation
grows rapidly compared to the increase in size of the data itself. But even then is not as bad as
O(kN), which is what we call exponential.

And exponential is very bad, meaning that even for relatively small values of N, it is not going to
work very well. Imagine that k was equal to 2, what it means is that for every one unit increase in
N that is for every record that you have added into the whatever your data store is, the time
required for performing the operation is doubling. Just add one more entry and now you have to
do double the amount of work. So, clearly, this is very bad.

There are problems where the algorithms, best algorithms known are even worse than
exponential you can basically go to factorial and other functions. But we do not even consider
those. I mean, there is no point in really looking at those kinds of algorithms. So, clearly, we
would like to see whether we can get to things which are O(1) or O( log N) at worse O(N).

(Refer Slide Time: 05:09)

So, with that in mind, let us look at how we would actually search for an element in memory.
And let us assume that somehow the data which I need has been stored in memory using some
kind of a linked list. And what I mean by a linked list is there is some data out here, A, and that
has a pointer to the next element B, that in turn points to the next element C, and so on until I hit
the end of the link, end of the chain. So, now how do I search for B over here? I start from A,
move on down the line, go to B, I hit the correct entry. I come back. If I want to search for C, I
would need to go through A, then to B, then to C and stop.

Clearly, this algorithm is order of N in terms of the runtime, because in the worst case, I would
need to go through all N elements in order to get to the final result. In the average, if I do not
know which one I am searching for to start with, I can sort of expect that it is probably going to
be somewhere in the middle and the time taken is going to be N/2, which is O(N). Remember
that O(N) is only order of magnitude, so any such by 2 and such constants are ignored. It is still
proportional to N. That is all that we care about.

So, in terms of database searching, not particularly good. Why, because typically you would
have to search many times in a database. And as the size of the database grows, if each of those
searches is going to take time proportional to the number of elements, you might run into serious
problems after some time.

(Refer Slide Time: 06:51)

Now, what if this data was sitting in an array instead? So, it is an array in memory. What is the
main difference? I know that the first element is stored here, the second element is stored here,
the third element is stored here, and so on, which means that if I need to access, let us say, X[i] I
can directly jump to that location.

The problem is, I do not know what i is. I do not know which location to jump to. And so I still
have to start from here, then go to the next element, then go to the next element until I finally hit
the thing that I was searching for, which means that, because the data in the array was not sorted,
I had to start from the beginning and go step by step, and therefore the time remains O(N).
(Refer Slide Time: 07:42)

On the other hand, what if I had data in an array that I have guaranteed somehow is actually
sorted? What I mean by sorted is basically that I guarantee that A<B<C….<Z, that the entries
sitting inside this array satisfy some kind of a comparison property. And what do I do with this
comparison property? I can basically say that, now I will start by looking over here. So, this is
the first comparison. And let us say that I am searching for D or let us say I am searching for the
letter P.

So, what happens over here is the first one that I hit this midway through, which let us say is M, I
know that because M<P, therefore, it cannot be in the first half. I can basically rule out half the
elements from the search. So, now I again go and look at the second half. So, basically, I am
going to look at element from 14 to 26 if I assume that we have, then basically I am going to
look at element number 20, which would probably be element T. And then after iterating some
few times, I will finally end up at P, not N but let us say some K steps.

Now, what did we actually do over here? Since I had this comparison, the less than property that
I could check, I could straightaway eliminate sort of large chunks of this memory. And basically
say, okay, after the first step, this first half is eliminated, after the second step, the second quarter
is eliminated, which means that at each stage I am going N to N/2 to N/4, at what point does it
become equal to 1.
And if you think about it, basically what this is saying is that this K is basically going to be log to
base 2 of N, the ceiling of that. I mean, it has to be an integer number. So, the smallest integer
that is greater than this log N to base 2. That is all that you really need in order to reduce the
number of elements in your array to 1.

(Refer Slide Time: 10:23)

So, what we have over here is, we can actually bring this entire thing down from O(N) search to
something which basically finishes in O(log N) and this is something called binary search and is
very commonly used in various kinds of search optimizations. So, what we can see is that, if you
are somehow able to store data in some form that is sorted and has some kind of comparison
functions and easy ways to access locations, you can actually do searching within O(log N)
steps, something to keep in mind when we start looking at actual databases.
(Refer Slide Time: 10:58)

Now, arrays by themselves have certain problems. I mean, the biggest problem with it is you
need to fix the size of the array ahead of time. Adding new entries is a problem. I need to resize
the whole array. Deleting an entry is also a problem, because then I need to push everything back
up. I need to keep the size of the array fixed. Otherwise, there is no point. I lose all the benefits
of this sorting that I was talking about.

(Refer Slide Time: 11:24)


So, because of that, there are alternatives that have been proposed. And once again, you would
have probably come across all this in the data structures and algorithms course. Binary search
tree is a good way of solving this problem. It is a nice efficient way of maintaining a sorted order
essentially, we have something called a root. And it has the property that everything over here is
less than the root, everything over here is greater than the root. And similarly, I could have like
multiple branches over here.

Once again, this would be less than, this would be greater than, but only greater than whatever
element is, its immediate parent, not greater than the root itself, it cannot be. Just because it is on
the left-hand side of the root guarantees that this element is less than the root. But because it is
on the right-hand side of its immediate parent, it means it is greater than that parent. So, in this
way, we build up some kind of a tree structure. And it turns out that this binary search that we
talked about works brilliantly on this as well.

Now, there is a problem with binary trees. Once again, I am not getting into details, because it is
out of scope of this course. The problem is so called balancing. Trees very quickly get
unbalanced. Problem can be solved. There are so called red black trees, AVL trees and various
other kinds of things. In particular, there is a whole set of structures called B trees, which not
only allow you to build efficient binary trees, but are also friendly to storing data on storage
mechanisms, such as disk.
All the other data structures that we talked about generally, assume that you have RAM, random
access memory, and that that is where you are really storing the data. B trees, on the other hand,
actually explicitly take into account the fact that you might be storing data on some other kind of
storage mechanism. And what is the important part.

The disk means that you will only be able to retrieve or write data in chunks of some size, at
least, that is when it becomes efficient. So, B trees are sort of built around that whole idea that
they are supposed to be disk friendly, while at the same time having many of the benefits of
efficient data storage and retrieval.

There is also something else called a hash table. And the hash table, once again, you will
probably come across this in the data structures course. But the bottom line is that, if you have
some kind of a magic function which can take whatever you are searching for and instantly
compute an index for it, a number, and let us say that your storage mechanism was to say that
whatever number I get out of that hash, I will use that as the index in memory where I am going
to store this particular element.

If I can do that, and it is big if in some cases, you can actually do searching in unit time, constant
time O(1), because all you need to do is compute the hash of whatever you are searching for, go
check that memory location, either it is there or it is not. So, in the cases where you can do it at
least or you can do it efficiently, it turns out that nothing can really beat this, but it is not always
applicable. There are certain specific conditions where it can be used.
Modern Application Development – I
Professor Nitin Chandrachoodan
Department of Electrical Engineering
Indian Institute of Technology Madras
Database Search

Hello, everyone, and welcome to this course on modern application development.

(Refer Slide Time: 00:16)

So, what did all of these things that we considered in terms of searching for data in some
storage mechanism? What has that got to do with searching in a database? The most common
databases have a so called, tabular structure. There are many tables, each of which has many
columns.
So, for example, I might have something called a student table where as you can imagine, it
would have a roll number, it would have a name, it would probably have a department, and so
on. And I might have another course table, which has an id number for the course, a name, a
description, and then I could also have a sort of combined table where I have a student_id,
and a course_id, which basically tells me, which student is registered for which course.

So, as long as I can create these kind of tables, it means, that I can store a lot of useful data,
and we have already come across this in the previous thing. This is basically the most
important part as far as model structure is concerned. We need to be able to create
relationships between different columns in between different tables, and those relationships
are what ultimately drives the usefulness of the application.

Now, if I want to search through all students, I cannot guarantee that the order in which I get
the list of students is going to already be sorted. I have to basically make the entries as in
when they come and register. So, what should I do? Should I have something, build a big
array somewhere in memory, in order to store this table, because after all, tables are well
suited to arrays they have the same kind of structure. There is a particular memory location or
a row number, where if I go I will find all the data corresponding to one particular entry.

So, it looks as though maybe I can use arrays. The problem is two people can have the same
name. So, I then need to find something that is unique. I assign a roll number. That is better,
because, roll numbers are guaranteed to be unique. So, what should I do? Should I just sort of
index, I mean, just store the data as it is or I can create something called an index.

An index is actually pretty much just a copy of the data present in one of the columns of a
table, but now, in sorted order. So, you basically take all the roll numbers, and you create an
index out of it. And what does that index do? It basically creates another copy, but now this is
sorted and also has the appropriate links into the correct row in the original table.

So, let us say that I want, I have an index on roll number and I want to search for the entry
corresponding to a given roll number, I just search in this sorted list. I know I can do that in
O(log N) time. And once I have the particular entry that I have found, I just go to the actual
table and pull out that row, and get all the other entries, the student’s name, the department
and all other information I need.

So, building the right kind of index is very important to the performance of your application.
You need to know, what kind of index to build and what is it that will change the behavior in
terms of, if you do not construct the right kind of index, it can do pretty poorly in searching.
If you create too many indexes, it might end up taking up too much space. So, there are a lot
of engineering trade-offs that come in over here.

(Refer Slide Time: 04:11)

Now, obviously, we do not have the time to get into all of this in detail. And in fact, it is not
even possible simply because you might not choose the exact same database that I
recommend or that I have used in some application. You might have a different reason for
finding something else. So, the point is, you should be able to look up documentation and
find out more about it by looking for it.

And a good example is the MySQL database. They in fact, have a lot of documentation
online it is available out there and they talk about the performance comparisons of different
kinds of indexes. In particular, they consider two things, something called a B-Tree and the
other thing called a Hash.

The B-Tree is basically like I said, a variant of the binary tree and is the most common kind
of index. So, I would strongly recommend that anyone who is at least interested in building
larger scale applications. If you are building a very small application probably most of this
does not matter. But the moment you want to go even to a few 100 entries, these things start
to matter. So, and anyway, it is good practice to sort of know the tools that you are working
with. So, if you are using MySQL, where should you look?
(Refer Slide Time: 05:30)

And this is an example from that link that I just showed you. For a what we can see in
MySQL, for example, or pretty much any index database is, let us say, I have a table in which
I am searching, so I want to do select * and the search condition is key column like Patrick
%.

Now, what exactly does this search do? It search, it will search through the database for all
entries, like let us say, Patrick X, Patrick YZA, all those entries will match. But let us say
Pattrick will not match because it is a different name.

So, the point is, because of the fact that it is Patrick, I know the beginning part of the name, I
can do this kind of binary search in order to narrow down the elements that I need to look at
in more detail. Because I can say for sure that anything, which does not start with a P can
straightaway be eliminated, among those that start with E, I can eliminate things that do not
have A as an X later.

How do I do this? I basically go through the tree and narrow down the section within which I
need to look. And because of the fact that it is starting from the first letter, I can actually take
this entire tree and build up a specific sub sequence, which says, okay, this was the root, this
was all things starting with P, this was all things at second letter A, this was all things with, T
and so on. And essentially, eventually, I will come down and say, okay, this is the subtree
that I really need to be looking at, and I can discard the rest of it.

Now, what happens if I have a query, which goes something like this Pat% something else?
Now, this is a bit more tricky because what has happened over here is, I can go up to Pat, up
to this point in the graph, after that, I need to start searching. So, all I can do is narrow it
down Up to this point and from there, I then need to switch over and go into some other kind
of string search where I basically look at all possible entries for the fourth character, and then
see 5, 6, 7 do they match with underscore ck? At least this Pat part of it allowed me to use the
index.

(Refer Slide Time: 8:06)

But what if my search was something like this? Here, the first letter itself is percent, which
means that, which way should I go? I do not know. Maybe there is a P somewhere over here,
maybe there is a P somewhere over here, maybe there is a P somewhere over here. I do not
know how many letters can be present before the P. Which means, that I cannot use an index,
I cannot use the tree that I have constructed at all.

Similarly, if I am trying to say, there is a key column, which needs to be like some other
column. I do not know, because I will basically need to look at this column, I will need to
look at this column and I will need to then search through all of them and find out whether
there is a connection or not.

Which means that, this index which was created on the key column is of no use for answering
a query of the sort. So, please keep this in mind. And the reason is simply because ultimately
you are constructing queries. For the most part, at least in this course, we are trying to use an
ORM like SQL alchemy, which takes care of a lot of this for you. The problem is, it is not
going to tell you if the queries that you are constructing are poor.
So, if you do some kind of complicated query or something which does not make use of the
indexes, SQL alchemy is not really going to tell you about it. There are ways of sort of
profiling the database, which will finally come up with information that would be useful and
tell you that, look, this is not really the way you should have done it.

(Refer Slide Time: 09:38)

You can also create, so called, multicolumn indexes. So, let us say that I have three columns
in a table or rather I have many columns and I use this as the index 1, let us say this one as
index 2 and this one as index 3. The order of the index has nothing to do with order of the
columns in the table.

It is a compound index. So, what does that do? It basically takes an element from here, takes
the corresponding element from here, takes the corresponding element from the i3 column
and puts them together and creates this new string and this is what the index is created on.

What do I mean by creating an index? Like we said, I mean, it basically creates a copy of all
of them and sorts them with links back into the database. So, how this would work in
practices effectively, this means that it is first sorted on index 1.

As an example, let us say that I want to create a compound index on date of birth, city of
birth, and the name of a person and what it would do is, I would call create this compound
index. First sort on index 1, which is the date of birth, then on index 2, the city of birth and
finally on index 3, which is a name of the person. If I wanted to query for all people born on
the same date, perfect, it just picks up the first entry and is able to narrow it down and tell me
exactly who are all the people born on a given date.
Born on a given date in a given city, again, good. But born on a given date, and with the same
name, but not necessarily in the same city not particularly good, because I cannot make use of
this combined query across date of birth and city of birth, at this point. I will only be able to
query for those born in the same date of birth, then I will have to narrow down the ones with
the same name, I cannot use this index for that I might have another index, and then finally,
go and narrow down the city of birth.

So, how you construct indexes has to be determined by the application developer. You need
to sort of have an idea of what kind of queries you are likely to see in practice, and based on
that, try and optimize.

(Refer Slide Time: 12:04)

So, once again, to this reiterate, if we have multi-indexes, and I had a part which says index,
part 1, index part 2, and this other part, it is not really using index part 3, but this is still good.
At least it is making use of index part 1 and part 2. Or here, if I have index =1, at least this
part of it is able to use, the index =2 thing, it is not really going to use.

What if I had index_part1=hello, and index_part3=5, but I am not specified what index_part2
is, it will sort of just ignore this and do the indexing based on index_part1. And let us say
that, if I had something index=1 index=2 or index=1 and index=3, it can use the index on
index 1, but it cannot really make use of the second, the 2 and 3 indexes. So, the search
engine inside the database will automatically figure out what is the best that it can do and try
and optimize for that.
(Refer Slide Time: 13:15)

On the other hand, what if I ignore index part 1? This is bad, it basically cannot use that index
at all. Over here, index=1 or =10 it is, I cannot really use the index at all, because the OR A
=10 is like a completely different thing, which cannot really use an index at all.

And I have something else, which basically says. This is again, say the problem over here, it
is an OR statement whereas an index is essentially an AND. It is taking index_part1=
something and index_part2=something else an OR breaks that. So, there are many queries
that you can easily create in this way, which are unfriendly to these kinds of multi-indexes.

So, should you even use multi-indexes? Should you use only single indexes? All that is
essentially something, which you need to go much deeper into databases in order to
understand them in detail. Why are they sort of important for an app developer to know?
Because there is no unique answer that is going to come from the database side. Only you, as
the application developer can say, in this particular case, I will need to access these entries
more and therefore, I should create an index on this.

The database at best can give you a recommendation, saying, look, you probably should build
an index on this. More importantly, there will be database engines that sort of keep track of
the queries that you have been running, and then can sort of recommend options where you
should be creating indexes to get better performance.
(Refer Slide Time: 14:52)

Now, finally, Hash indexes. This is something, which is again available in MySQL. But
remember, like I said, the use of a Hash table is only for equality comparisons. In other
words, what it is doing is, taking what you are searching for, let us say the name of a person
and computing some function of that and directly saying, look, this is where the entry for this
person is stored.

If I can create something of that sort, then fantastic. I mean, it is going to be faster than any of
the B-Tree-based searching that I could do earlier, but it cannot, for example, handle a range
of names. All peoples whose names begin with N; it cannot really do something of that sort.
You give it an exact name it can search present or not present.

Similarly, if you need to search for all people whose name begins with N, and you know,
order them by age or something of that sort, it does not help. So, where it is applicable, it can
be very fast, but the question is, where is it applicable?

It is provided because there can be applications where you need to use something of this sort.
MySQL, for example, does have something of the sort, so if you find that it is the right fit for
your application you should use it.
(Refer Slide Time: 16:04)

So, various databases have their own takes on query optimization. MySQL has a whole lot of
information, like I said, what I just showed you, SQLite lite has something about
optimization, overdue. Postgres, in fact, has even, it is sort of a research database in some
sense and they even have things like, using genetic algorithms in order to optimize query
structures.

So, these are all different possibilities that can be done. And in fact, is, to some extent is the
topic of research even now. A large part of databases are very well studied so there is not too
much new happening out here, but on the other hand, there are places where you might be
able to make some fundamental changes as well.

(Refer Slide Time: 16:52)


So, to summarize, setting up the queries properly can have a huge impact on the application
performance, and building proper indexes is crucial to search. Now, you might think that, I
am only going to build small apps, but you do not know at what point it stops being small and
starts being significant. Because even something with a few hundred entries in a database can
get really badly impacted if your indexes and your data store storage mechanisms are bad.

The moment you hit a few 1000 already log to base 2 of 1000 is 10, which means that either
you are taking 1000 steps to find something or you are taking only 10 steps, so you have a
factor of 100 difference in terms of what you are likely to spend in searching for something.
That is a very rough estimate, but still you get the idea. So, do not sort of underestimate the
value of creating the right kind of indexes.

Now, creating too many indexes, on the other hand is a bad idea. You will end up wasting
space and it just confuses the database. It does not know how to search and or rather, which
index is the right one to use. And ultimately, this whole idea of how to structure the data and
put it properly the so called normalization step that is used in databases, which again, is
completely out of scope of this course that is something, which you need to pay attention to
especially I f you expect your application to grow larger with time.
Modern Application Development - I
Professor Nitin Chandrachoodan
Department of Electrical Engineering
Indian Institute of Technology, Madras
Memory Hierarchy

Hello, everyone, and welcome to this course on Modern Application Development.

(Refer Slide Time: 00:16)

Hello, everyone, and welcome to this lecture as part of the online B.Sc. course. So far, we have
been looking at various aspects of, the process of developing a web app and we have got up to
restful APIs. And we now have a pretty good idea of what the structure of the code should look
like, what are all the components that should go into the code, what are the parts that take care of
handling the data model, what are the parts that take care of presenting the user with views and
the controller which basically interacts between the two.

So, now we are going to take in this part of the course we are going to sort of start going a little
bit deeper into some of the aspects that have already been touched upon to some extent in the
previous lectures, but are ultimately important for an app developer to know in order to make the
best use of the resources that you have. So, in this video, we are going to look at backend
systems. What exactly do I mean by back end systems and what are all the various variants that
we need to keep in mind when you are looking at this.
(Refer Slide Time: 01:18)

So, to start with, I am going to have a brief digression. This is completely unrelated to apps in
general, but it is more to do with computers and how they are built and how they operate. And
take a little bit of a dive into the so-called memory hierarchy.

(Refer Slide Time: 01:34)

So, what exactly do I mean by the memory hierarchy? The first thing is what do we mean by
memory? What we are talking about is the various kinds of storage elements that are available
for storing data on a computer. The first thing that all of you would be familiar with at least as
long as you have done a basic course on some kind of processor architecture or even if you have
done something which involves, let us say, programming in Arduino or a microcontroller of
some sort, any basic processor, a CPU that we call has one core chip. And one of the main
elements of that chip is a set of registers which are used for storing temporary values.

Now, if you have written a C program and you have declared something like int a, there is a
good chance that a actually gets declared just as a register, unless it is actually necessary for a to
be stored somewhere in memory and you need to have pointers to it and so on. Otherwise, what
will happen is the compiler will decide that, okay, you are only using a for some temporary
computations, let me store it in registers.

Now, what is the catch and why that. The main thing that you need to understand about registers
is they are pretty small in number. You may be able to store typically around 10 and probably at
most around 30 or so values in the registers of any modern processor. Each of those values could
of course be either 32 bits or 64 bits depending on the type of processor, but you can imagine
that this is not really a whole lot if you are trying to write a large program.

The next thing which you may have heard of, especially when you talk about a processor, you
have probably heard the term L1 cache or L2 cache that is the C-A-C-H-E pronounced cache.
Now, this cache is implemented using something called static random-access memory. You do
not need to exactly know why it is called static RAM, although if you are interested then you can
read up more about it. The important thing that you need to understand about cache memory is
that it is limited in capacity, much better than registers, because you are not talking about tens of
bytes or even hundreds of bytes, you can go into several kilobytes, probably on the order of a
megabyte or so.

Now, for those of you who have actually tried running programs on computers, you would know
that a megabyte is not all that much memory. But those of you who have tried microcontroller
programming might actually realize that quite a lot can be done within 1 megabyte. What is the
purpose of a cache?

It basically serves as some kind of temporary store of data. And as you can see, it is fairly limited
in quantity. But it turns out that access speeds to the cache are much faster than to the main
memory, which means it is a, it performs a very nice function, sitting in between the processor
and the main memory and allowing you to run parts of programs faster than they otherwise
would.

So, you have registers, you have SRAM cache, and then you have the main DRAM, the dynamic
RAM in the computer. This is what is typically referred to when you talk about the RAM of your
PC. Let us say you have a laptop with 4 GB of RAM. This is it. It is the DRAM that we are
talking about. You would like it be as large as possible. Typical amounts these days are between
4 to maybe 16 GB or so. There are of course servers which with much more than that. But that is
really fairly high-end servers.

After the DRAM comes what is popularly called these days SSD, solid state disk. Why exactly
solid state? It is sort of historical terminology. The main thing is it is implemented using a certain
kind of flash-based technology. It is non-volatile meaning that even if the power goes off, the
data is retained on that SSD and you can have capacities up to several 100 gigabytes, so
obviously, much larger than DRAM. What is the catch? We will get to that in a moment.

And then you have magnetic disks, the actual storage, the hard disk which is used in PCs. This
typically these days in the order of terabytes. So, you could have even like a 10 terabyte disk,
although more common probably is 2 to 4 terabytes on most systems at the moment. Now, you
can go beyond this. There are memory storage technologies that are capable of storing even like
hundreds of terabytes, petabytes of data.

Where do you come across petabytes of data? Let us say you are running the Google Data
Center. You have to store pretty much all the data for search, all the data in every email that is
used by a Gmail user. You are literally running into several petabytes, 10 to the power of 15
order of bytes storage. So, is magnetic disk enough? Are there other technologies that are
needed? How do they actually manage this? Beyond the scope of this course, but very interesting
topics on their own.
(Refer Slide Time: 06:47)

So, now that we know that these are different ways in which you can store data, let us understand
a few parameters that are used in order to actually understand the behavior of these technologies.
The first is the so-called latency. In other words, if I want to read a value from memory, how
long does it take? And lower is better, meaning that the less the latency, the faster the turnaround
time. I asked for data, I get it back.

Registers, like we said earlier, are literally inside the processor core itself. So, the turnaround
time, when you try to read something from a register, is on the order of nanoseconds. From
SRAM, it could be tens of nanoseconds, possibly hundreds of nanoseconds, but still pretty fast.
DRAM latencies can be high, meaning that they could be on the order of several microseconds
even, so clearly much slower than SRAM. SSD would be even slower, hundreds of
microseconds. And HDD, even the time to get the first bite out of it is in the order of
milliseconds, at least, maybe tens of milliseconds, if not more.

Now, a millisecond might sound fast, but when you compare a millisecond with a nanosecond,
you realize that is 10 to the power of 6 order difference. And that makes a huge difference when
you are actually trying to run an application fast. So, clearly, from this, we can see that registers
are the fastest. But we already saw that capacity-wise there was a problem. The other thing that
we have is that another parameter to consider is the so-called throughput, which is the number of
bytes per second that can be read out of a system.
Now, throughput is usually not even considered for registers in SRAM because the capacity is so
small that you cannot really talk about megabytes per second when you are reading only a few
kilobytes. So, instead, at least for DRAM, solid state drive and hard disks, you can clearly say
that there is sort of a hierarchy. DRAMs are faster than solid state disks than SSDs which are in
turn faster than magnetic HDDs.

Then comes the density. How many bits can be stored per unit area or more importantly per unit
cost, because area can be misleading. You might say that DRAM has extremely high density
because literally it is taking a fraction of a micron to store a bit, but creating large DRAMs,
which can store like hundreds of gigabytes or even terabytes of data is going to be very, very
expensive.

So, from that point of view, the volume manufacture becomes very important and we see that the
HDDs are the most cost efficient followed by SSDs, followed by DRAM, followed by SRAM,
followed by registers, which in some ways also explains why the number of bits stored in each of
these follows this trajectory.

(Refer Slide Time: 09:50)

So, now, why is all this important to know, because at the end of the day, you are using a CPU
and from the point of view of a computer organization, the CPU, you would like it to have as
many registers as possible so that your programs run fast. That, those registers are usually
backed by various forms of cache memory SRAM that in turn is backed by several gigabytes of
DRAM working memory, some of that then goes to SSD drives, and finally, backed by HDD for
high capacity.

Even the HDD is not the end of the story. Ideally, what you want is to have some way by which
you can be sure that you do not lose data, which means that even what is stored on HDD finally
it goes out to long term storage, backups or some other form of data archival, where you can be
reasonably sure that you are not going to lose the data no longer, how long it is sitting over there.

(Refer Slide Time: 10:48)

There is such a thing as cold storage. And you might have heard of things like Amazon Glacier,
or Google, Azure, Cloud, all of them have some variants of archive storage classes. And the
main point over there is that we are talking about backups of data. Let us say that you have all
your emails that you have sent, various files, photos that you have clicked and so on, you do not
want to lose them. What if your phone crashes or your PC crashes?

You upload them all onto Google Drive. But how does Google make sure that it does not lose the
data. It cannot just put it on one drive over there and say your data safe, do not worry about it.
Obviously, a hard disk can crash. So, they have to have backups, which allow them to restore
data.

This is going to be huge amounts of data. The petabytes is just the beginning of the story. I mean,
this is something which never gets deleted. So, it is only building up with time, which means that
you need archival storage, things which can store for long periods of time, at the same time huge
volumes of data.

Now, the decision that is usually taken and this is why it is usually called cold storage, is that it
will be put on to forms of storage where there is no guarantee on how fast you can get the data
back. You might have to wait four minutes or sometimes even hours because Amazon Glacier
sometimes says that the retrieval from glacier could take 48 hours. But still the point is the data
is there and it is safe.

Why is it safe, because they have put it through so many levels of redundancy and stored it
literally in safekeeping wallets, which are then periodically checked, which means that if you
need to pull out one particular piece of data you might have to wait for quite a while before it can
actually be done. They have very high durability, very low cost for the volumes that are being
stored.

(Refer Slide Time: 12:40)

Now, why is all this important to you as an application developer? You have an application you
are building, it has a database, it stores information, let us say it is a social network kind, social
media kind of application. It has information about users. It has something about what users like.
It has various other connections that you have built up. And at the same time, you also want to be
able to store relationships between those entities that you have created.
Or even if it is not social media, let us say, it is even the NPTEL website, we have records out
here, or the online degree website, the NPTEL website already has records of several lakhs of
students who have taken courses. And each of those student’s grades need to be stored, their
grade cards have to be available.

What, during which semester, did they take what course, all of their information needs to be kept
safe. This starts building up after a while. Which means that as an application developer, you
need to think about how much is your storage, what kind of impact is it likely to, is the type of
storage that you use going to have on the performance of your application.

So, let us say that you decide to just store everything in flat files on a disk, you will rapidly run
into performance problems, because just reading and writing from the disk will slow down the
number of requests per second that your server can handle. But let us say that you decide to put
everything in main memory, either your server becomes ridiculously expensive or you need a
large number of servers. And in both cases, what happens if the, if a server crashes? You lose
data because it is volatile memory.

At the same time, some data stores are more efficient for certain kinds of operations. So, you
might have an application like let us say log file analysis, where you are writing a lot of data, not
necessarily reading it very often, some kinds of storage may be more efficient at storing
information like that. Some others may be efficient at reading large amounts of data. So, as a
developer you need to be aware of your choices, you need to be aware of what kind of servers
your system, your application is going to run on and what kind of database to choose for a given
application.

Now, this is of course, the first course on application development. So, we are not going to go
into detail on all of this. What I will be doing is just explaining these at a fairly high level so that
you are aware of these issues. And then we will get on to actually just sort of the different
options that are available to you.
Modern Application Development - 1
Professor Nitin Chandrachoodan
Department of Electrical Engineering
Indian Institute of Technology, Madras
Scaling
Hello, everyone and welcome to this course on Modern Application Development and all of
that brings us to the topic of Scaling, so what do we mean by scaling in the context of a
database.

(Refer Slide Time: 00:25)

The first thing we need to think about is the notion of replication of data. So, you might want
to have multiple copies of data for different reasons. One of them is so called redundancy and
redundancy is usually used for things like backups, let us say that there is a chance my laptop
fails, I would like to have a copy of all my photos and documents somewhere else, so that
even if the laptop fails, I can retrieve it from somewhere else.

Now, I am not going to be accessing that other location all the time. So, the sort of speed at
which I access it or how efficient is it to access it is not really the issue. What is important is
if one fails, the other should survive.

On the other hand, in the context of things like databases, I might want to do something
called replication of data and replication over here is used purely in the context of
performance, it is not really for the purpose of a backup. So as an example, I might take two
servers, which are sort of identically configured and somehow get the data replicated onto
both of them.
What is good about it, any read queries, for example can be answered by either one of them, I
do not care and the answer is going to be the same. But they are both located in the same data
centre. So, what if the data centre catches fire, they are both lost, So, the redundancy aspect is
not really coming into the picture here. Now, live replication of data like this requires very
careful design of the database, the data structures and various other things as well.

(Refer Slide Time: 02:02)

And once again, going back to no SQL that we talked about earlier, no SQL databases often
follow something called a BASE philosophy rather than acid. BASE stands for basically
available soft state and eventually consistent, classic case of desperately trying to get an
acronym just so that it sounds cool.

But anyway, the important part over here is the eventual consistency and what this means is
that the replicas can take time to reach a consistent state. More important than consistency is
high availability of the data.
(Refer Slide Time: 02:44)

Now, let us look at an to understand this, let us look at replication in a traditional database.
RDBMS replication is possible, of course in PostgreSQL, MySQL, Oracle, of course, I mean,
has been doing it for decades and usually, the way they do it is have a cluster of servers all in
the same data centre, connected with a high speed network and it is used for the purpose of
load balancing, replicating across a geographically distributed framework is harder to do real
time replication. Because there are latency constraints and they affect how you can guarantee
consistency.

(Refer Slide Time: 03:27)

Now, this sort of leads us to this thing, where we have two different notions of how you can
scale when your application becomes bigger, the traditional approach is to use something
called scale up and this is what Oracle, for example, excels at, you use larger machines with
more RAM, faster networks, faster processors.

But every time you want to upscale, you want to sort of change the infrastructure of you are
running out of memory, you pretty much have to restart the system, some of the servers of
course, have been tuned even there. So, for example, they have a mechanism whereby you
can hot swap, meaning that you can add more disks while the system is still running, you
might even be able to add memory to the system, while it is still running and they even have
operating systems, customised operating systems that are able to benefit from that.

A simpler in some sense approach is scaled out. Can I just make more and more copies of the
server and just bring them up all over the place? I run out of space on one server fine, add one
more machine and somehow let it kind of replicate and follow along with the queries. This is
especially well suited to the so-called Cloud model.

Because what Google or AWS then does is in their data centres, they have a large number of
machines. They create VMs for your virtual machines, each of which has a particular
specification. You have run out of memory on one of them. Now, do not even bother with
restarting it just create one more and add it to the pool of servers that you have. So, scale out
is very well suited to the modern cloud-based approach.

The problem is, it is not particularly good for acid kind of databases. Because you have that
consistency problem to maintain and that is sort of the key to why NoSQL is really popular in
these kinds of applications. As long as you can accept this notion of eventual consistency,
you can make use of this kind of scale out, which means just adding more servers, which can
be done at runtime pretty fast, so Google App Engine or Google Compute Engine can do this
automatically for you. They, it means that your application can scale very neatly, even as the
number of users suddenly starts increasing.
(Refer Slide Time: 05:51)

Now, of course like I said earlier, this is highly application specific, in particular for financial
transactions it is simply not acceptable, they cannot afford even the slightest inconsistency,
which is why you will find that, financial applications are probably the last ones to sort of
move towards the cloud, they need all kinds of other guarantees and even in the cloud, they
will need specific kinds of structures where they own the entire machine on which the
information is being processed.

But on the other end, typical web application, social networks, media, eventual consistency is
just fine, slight delays are not going to make any real difference. Now, even in ecommerce,
interestingly, let us say Amazon or Flipkart, only the financial part needs to be in an acid
compliant database, the list of items that are available for a given search could just as well be
in some kind of a NoSQL database, scaling out immensely.

Let us say that, unfortunately, you click on something and then it turns out that it is already
been sold out, bad luck been just ran out of these items before you clicked on the buy
transaction. But once you have committed, you enter your credit card and go and press go.
There should be no way that it sort of backs out and says we took your money, but we cannot
send you the item. So, there is always some kind of a trade-off that you need to be aware of
when you are building an app like this.
Modern Application Development 1
Professor. Nitin Chandrachoodan
Department of Electrical Engineering
Indian Institute of Technology Madras
Security of Databases
Hello, everyone, and welcome to this course on modern application development. Now, the
last part about this session on databases, relates to so called security and this is also relevant
to the construction of an app itself.

(Refer Slide Time: 00:30)

Now, in the context of an application, you ultimately are querying something from a database
and if this was a so called non MVC app, because after all, remember MVC is only guideline.
It is not enforced, it is not that you can only build web applications with the MVC
framework, you could just as well have a PHP script, which directly queries from a database
and draw something on the screen. You could have SQL queries anywhere.

In MVC, of course, you know, ideally, you want it the SQL where it comes, not directly even
in the controller, it is only in the model. So, the controller sends the queries, constructs the
queries and sends them to the model and the model is the part that actually handles
communicating with the database. So, if at all there is SQL involved, it is going to be only in
the model part, not really even the controller section. So, the question becomes what is
dangerous about an SQL query?
(Refer Slide Time: 01:31)

So, let us take an example. This is what a typical HTML form could look like. Basically, you
have input, for name and input for password, it would get rendered something like, what is
shown out here. Which means that finally, you would go and you would type some
information for the username, you would type some information for the password and either
hit enter or there will be a submit button that you click on.

(Refer Slide Time: 02:01)

What happens once that is done is that you will typically have some code inside your script,
let us say you are using Python, you might, take form.request[“name”] form..request or either
should be password and you are going to construct a query out of it. Now, let us say that your
query was just being done by, stringing things together, use the Python + operator and just,
concatenate strings. Seems like a reasonable thing to do, potentially very bad. Why? Because
you have just directly taken whatever was sent in through the form and you have put it into an
SQL query.

Now, who has access to the form, pretty much anyone who has access to that website, not just
that, you even cannot be sure that someone actually went to the form and typed it in, they
might even just construct an HTTP query directly using curl or something like that and send
it to your server. The server has no way of knowing what the client was actually doing. That
is the whole point of HTTP stateless condition.

(Refer Slide Time: 03:07)

So now, what happens if I just type in abcd and you know, let us say I had put in pass over
here for the password. It would basically do select * from users where name=”abcd” pass
=”pass”, just interpolate both of those perfect, nice save query.
(Refer Slide Time: 03:24)

But what happens if I actually typed in some stuff like this, double quotes or double quote,
double quote equal to double quote, it looks messy, why am I even typing something like
this? But look at what happened, I have now got a query, which says select star from users
where name equal to blank or blank and pass equal to blank or blank.

What does this name=“” or “” mean? It basically effectively resolves into a null condition,
because it is not even going to look for the name and what will end up doing is just selecting
* from users and so, blank or blank will basically just, this part of it will just, turn out into an
true statement and that is it. So, you would end up getting all the entries from the database.
So, the result is that you have leaked information, basically something which was supposed to
be in the database, which someone else was not supposed to be able to see has come out as a
result of this query.
(Refer Slide Time: 04:34)

Now, even worse, is let us say that you did not even have the sort of single quotes and so on
over here, you just straightaway do name equal to this plus name. What happens if I give an
input like this within my box over here, I basically say a semi colon, drop table users semi
colon. What is the query that is constructed select * from users where name = a; drop table
users, they are selected separated by the semicolon, they will be treated as two separate SQL
queries and both will run result is total destruction of your data, what is the problem?

The problem was that this name, this parameter came in from outside and you did not
validate it, you did not sort of check in any way, whether it had, these extra characters, all
these semi colons, they were the root of the problem, That is what allowed somebody to
construct two queries like this. And as a result of that, it just basically constructed something
which should not have been possible.
(Refer Slide Time: 05:36)

Now, the parameters from the HTML were taken without validation and validation should not
go through multiple things, I mean, it should basically check first of all, are they valid text
data? No, especially characters or other symbols. In particular, nothing like semi colons or
quotation marks, they should not be there or if they are, they have to be sort of escaped out
into something which, looks sufficiently different that the SQL query will not get confused.

You do not want any punctuation or other kinds of invalid input and you could also have an
extra level of validation, which actually checks I mean let us say that it is supposed to be an
email, does it look like an email, does it have one front part, an add symbol @ symbol which
looks like a domain name after it?

If it is supposed to be a date? Does it look like a date? So, all of this validation must be done
just before the database query, what do I mean by that, you have to be very careful that you
do not assume that just because you have, let us say, web page, which is behind some secure
firewall or whatever it is and you have told people to enter the correct information, that the
data you get will be correct, because somebody might still figure out a way to send a query
directly to your server, without going through the form that you created and as we will see
later, you might have JavaScript which is being used for validation. But let us say that the
person has directly constructed the query, they are just going to bypass the JavaScript
altogether and create a query that comes into your server and destroys data on it and you have
no way of stopping.
(Refer Slide Time: 07:18)

So, web application security, in other words needs and can have a lot of different variants. In
fact, this is probably the subject of a course in itself, what I described over here is what is
called SQL injection, where you are basically trying to inject some invalid or bad constructs
into the SQL query. The best way to get around it is to use known frameworks and good solid
validation that has been provided, usually that has been tested in the field, you might write
your own validation, but there is always the chance that you have missed something out, so it
is better.

That is one of the chief reasons for using known frameworks, even though you feel that you
might know what kind of problems are there, there might be something else that gets
discovered, it will get fixed in the frameworks first, because a lot of people are using them
and you might miss it and not really be able to accommodate it in your app.

There are other things called buffer overflows and input overflows, which are basically
related to the length of the query the URL and so on, which can sometimes even crash
different clients. So, specific types of clients might crash because of this or specific types of
servers might crash? What happens to the protocol implementations of servers? I mean, are
they able to accept any kind of query, what happens if I inject some unknown characters into
the middle of a query, into the middle of a request, which is being sent to a HTTP server.

And one thing, actually that are not mentioned here. Remember the character set and all the
encodings that are used, the reason why we have character sets is of course, so that we can
use different languages like let us say, Hindi or Tamil. But the problem is, I can also have
different characters or different entries from different character sets that look very similar to
things that I am already familiar with, but which can actually end up crashing a server.

So, all that validation has to be done right at the last step before it hits the database. So, that
you are sure that only valid data is going to get inside the database. Possible outcomes can be
serious, I mean, you can lose data, you can expose sensitive data or you can even change data
without knowing it and all of these are potentially deadly to whoever is running the app.
Because let us say that you are, even leaking, let us say credit card information or changing
the date of birth of a person, all of those can have very drastic effects on other things that you
may not even think about at the time.

(Refer Slide Time: 10:01)

Now, a word about HTTPS literally, this is I am not really spending too much time on it.
Because at this point in time, pretty much HTTPS is something which you should think of as
automatically, pretty much, anything any app should use or any web server should pretty
much use HTTPs and the reason for that is, all that it does is let us say that you have a server
and you have a client, there is this pipe of data which is flowing between them, the server
can, the client can send request to the server, the server sends back responses to the client,
you can think of it as a pipe with data flowing across it.

Now, in pure HTTP, a third party can tap into this somewhere, it could literally be putting a
tap on the wire or it might be that, you know, it is passing through some intermediate router
and I am able to look at it in one of the routers, the packets are there and I just basically
reassemble the packets at one end and see the data going back and forth.
Which means that a person who is taping over here can see all the information going between
server and client. What HTTPS does is it provides something called a secure socket layer,
which essentially says that this tap becomes impossible. Mathematically impossible at least,
mathematically very, very hard to do. That is probably the way to put it.

And, effectively, what it is saying is that, once HTTPS with the kind of protocols, kind of
servers, kind of clients is in place, it is now extremely difficult for a third party to tap into this
and be able to know, what is the communication happening between the client and the server.
There can be instances where you do not care about it. But by default, pretty much it is
usually safer to assume that HTTPS is the better way to go.

Because the moment you have anything down to something like a password, it has to be kept
safe and, in fact, nowadays, you will notice that you even browsers have extensions or
variants called HTTPS Everywhere, which tries to switch you over to an HTTPS version of
any website. Because the assumption is that by default, you should not be sort of using plain
HTTP.

Even by accident, you do not want to be leaking information. How does HTTPS work? There
is a lot of theory behind it and the important thing is there is something called a server
certificate, which is, what results in that small green sign on the URL bar and that server
certificate has been essentially verified by some third party who is trusted both by the server
and by you, by you meaning whoever created your operating system.

Very difficult to spoof based on mathematical properties, which ensure very, very low
probabilities of accident, accidental mismatches or something. But the important thing to
remember over here is that it only secures the link for a data transfer, it does nothing about
the data which is going through.

So, in particular, it does not perform validation, safety checks, nothing of that sort. All that
you can say is nobody can tap into your information, they can still feed you junk. One
problem with HTTPS is that, previously when plaintext data was being sent back and forth,
intermediate proxies could sit over there and say, this is what you are asking for here. I have
it. Let me give you back the data.

Now, the intermediate proxy, by definition cannot see what is inside the request that you are
sending. So, it cannot send you back information, even if it has it. So, it has a pretty big
negative impact on the cache ability of resources like static files and also some overhead on
performance itself.

(Refer Slide Time: 14:00)

So, to summarise this part on security, internet and web security are complex beasts, they are
pretty much enough for a course in themselves and right now, as an application developer,
the main thing that you should be thinking about it as, as far as possible, use known
frameworks with clustered track records.

But also, be aware of where your application is going to run, are you running it on your own
server or are you running it on something like Google App Engine, if it is something like App
Engine, or slightly better off because they are taking care of all the problems of running the
server, maintaining the security patches on the server, preventing things like buffer overflows
and things of that sort.

So, you have to concentrate only on your own validation so that your database does not get
messed up. So, as an app developer, in other words, you should be aware of all possible
problems that arise from the code that you have written. But also need to be aware of
problems at other levels of the stack and by stack, what I mean is, your application is finally
running on top of, some kind of a server or language interpreter, which is running on an
operating system which is running on some kind of hardware which is running in a data
centre.

So, there are many, many different stages before your clients or users actually hit the
application that you have written. So, at this point, the main thing to keep in mind is there are
such things as SQL injection and various other kinds of attacks that you need to be aware of,
and make sure that cannot happen in the code that you write. But the rest of it is also
something that you need to keep an eye on. You need to know where you are running your
application, so that it is actually going to be safe.
Modern Application Development – I
Professor Nitin Chandrachoodan
Department of Electrical Engineering
Indian Institute of Technology Madras
SQL vs NoSQL

Hello, everyone, and welcome to this course on modern application development.

(Refer Slide Time: 00:16)

So, all the discussion that we have had so far applies mostly to so called traditional relational
database management systems or RDBMS, because we have been talking about table
structured data. But you are very likely to have come across this term NoSQL at some point.
So, what I want to do here is to sort of go over alternatives to SQL or these kinds of
databases, and hopefully also make some things a bit more clear about why exactly we call
something, NoSQL and what are the implications?
(Refer Slide Time: 00:47)

So, first of all, what is SQL? It is a structured query language. It is used to query databases
that have some kind of structure in them, but the query language as such is not tied to a given
database, it could even be used for querying spreadsheets, for example. So, if you look at
Google Sheets, there is a variant of SQL that can be used in order to make queries inside of
Google Sheet.

Now, having said that, SQL is typically closely tied to RDBMSs, relational database
management systems, where you have data being stored in tables with the values actually
being in columns or fields of the table and there can be some tables that hold relationships
between other tables. And the important point or here is, every entry in a given table must
compulsorily have the same set of columns. This is the definition pretty much that it has a
uniform table structure.

So, why are these kinds of tabular databases used? They are very good at storing data. You
have prior knowledge of the data size, and you can also do efficient indexing. You can
basically pick one column out of the different columns in the table and say, index on this. The
database engine basically creates an index, and thereafter any searches on that particular
column are going to be fast, faster than they would have been without an index.
(Refer Slide Time: 02:20)

Now, what are the problems with tabular databases? They are structured, which is usually
good, but can also be bad. As an example, let us consider that you are trying to create a
database to hold student information. And at least in IIT, what happens is a student could be
either someone who stays in the hostel in which case you need to know which mess they go
to, and what are their mess fees, have they registered and so on or they could be a day
scholar, in which case, they are probably coming in from outside using their own vehicle, and
in which case, they need to have a gate pass for the vehicle.

Now, in a regular tabular database, the student table would need to have one column, which
indicates the mess and another column, which indicates the gate pass. Now, let us say that
half the students are hostellers, half are day scholars, I have half the entries in the mess
column as null, and half the entries in the gate pass column as null.

So, is this really an efficient use of space? Obviously not? But what can you do about it? The
whole idea of a tabular database means that I have to have the same set of columns for every
entity. So, this is sort of what got people thinking about how to store data, other alternatives.
(Refer Slide Time: 03:41)

And a number of alternatives have been proposed. So, one of them, for example, is a so
called document database. So, document databases say that a document is essentially some
free form information, but with structure. So, it is not that it is unstructured. On the right here,
I have an example of a possible document. This is basically from Amazon's example, for a
document database, and there are two, it is basically a JSON file or a JSON object over here.

It has two entries. So, this is one entry and this is the second entry. So, the year is common to
both of them, the title is also common to both of them, and in fact, the information is also
common. But within the info we find that plot is, yes, this is also common to both of them,
the rating also happens to be common. But what are all these other entries? One of them has
it, the other does not.

Now, a document database basically says that this is perfectly fine. I can have different
entries being stored as documents inside this database, and those documents are essentially
JSON objects, which means that any search in it will ultimately return a JSON object
corresponding to a particular document.

How exactly do I index on it? That is a different story. You would probably say maybe you
can index on the document.info.rating and then it would go through and find out all the
documents that have an entry for document.info.rating and index on those, and other things
that do not even have that information would not even show up in that index.

So, that is pretty much all that it says. You are storing data similar to what you would in the
other structured database, with the difference being that you no longer have tables of fixed
sizes, you are storing arbitrary documents. But how do your searches become efficient
primarily, by the way that you construct indexes. And just like you could construct an index
in the previous case over here too you can take one particular key in the JSON object and
decide to construct an index on that if you wanted to make the searches fast.

So, these are actually very popular these days. MongoDB is probably one that you might
have heard of, at some point. It is also probably one of the most popular in terms of document
database, there are alternatives, and Amazon provides something called DocumentDB for
this.

You will notice, by the way that Amazon provides alternatives for every kind of database I
mean, because that is their job. I mean, the Amazon web services is ultimately trying to
provide solutions for whatever it is that you are looking for. So, MongoDB is an example of a
document database. You can see where it differs from a traditional tabular database.

And, the advantage over here is, let us say, going back to the previous case of students I
would have a student.hosteller.mess or student.dayscholar.gatepass and they will be
completely independent of each other. One student would have a hosteller substructure, the
other would have a day scholar substructure, and the corresponding entries would be
available there.

(Refer Slide Time: 07:19)

Now, there are other ways of storing data one is called key value. And in Python, you have
come across this notion of dictionaries. And what is a dictionary? It basically means that, I
can sort of have an array where I can say, for example, something like a[“apple”] and I could
give it a value red. Now what is happening is that this apple is being taken as an index, in
order to find some particular location, and in that location, I store the value red.

Now, one way that this is done is actually by using something called a Hash table that we
already discussed earlier. There are other ways, ultimately, all that you need to do is to be
able to search for a particular location, and then put something there. And that search could
also be done using other techniques, such as binary search trees.

So, for example, the C++ OrderedMap is actually implemented as a form of a binary search
tree, not as a hash table directly, but the point is it is efficient at implementation and search. It
maps a key to a value and they are very efficient at key lookup, but not very good for range
type queries. Once again, you might have come across this term called Redis. Redis, is
usually used as a very efficient key value store. There is also memcached, which is used for
something similar, the BerkeleyDB is something much older, but again, has a similar kind of
behavior.

Now, one thing that is usually done over here is most of these are implemented in so called
in-memory form. Meaning that, you do not really want these to be things that are stored into
external disk. They really have their value when they are sitting completely in the memory of
the system, and can respond to queries very fast.

So, for example, something like, if I just want to have a list of students and say, have they
finished, have they completed a course or have they finished enrolling? And I retain that in
Redis, then the query becomes instantaneous, almost. I give the student ID, and it tells me yes
or no whether they have registered. I do not need to go to the main database and look up
something. It is much faster. But the problem is, it is only a lookup based on an exact query

So, if I have that kind of thing, they are very fast, much faster than any other kind of database
that you can think of, but they have sort of that limited use case. They are very often used
alongside other databases. So, that is what you will notice in practice.
(Refer Slide Time: 09:47)

There is something called a columnar store. And the idea here is that traditional databases
sort of if I had a table I would take all the entries corresponding to one row and store it
together. So, this entire thing would form one block somewhere on disk. This next one would
form another block somewhere, which makes it easy to sort of retrieve all the entries
corresponding to a given row.

But let us say that I have a different kind of query structure where I want to go and say, can
you give me all the people born on a given date. In which case, what I am looking for is
basically all the entries corresponding to a given column. Now, there are databases that sort
of specialize in that. They just change the way that data is stored, so that they are better at
answering these kinds of queries. Those are called column stores.

There are variants on this otherwise called also wide column stores where a large amount of
data is stored inside a single column. And if you actually need to go inside, you need to go
and look up the entries inside the column and figure out what is it that actually matches with
your query.
(Refer Slide Time: 11:01)

And finally, we have this other notion of something called a graph. So, the idea of a graph is
easy to explain. Let us say, there is a person A, and person A knows B, and C, and D. C and
D also know each other, but none of them know B. This is an example of what Facebook is. It
is basically the friend of a friend network or the social network that we are talking about.

There are some people who know lots of others, there are some who do not know many other
people. But now I can sort of try and form a relationship. How do I get from B to C? And the
answer is, go through A. So, if B and C want to talk to each other, maybe they can ask A for
an introduction. So, those of you who have a LinkedIn account, for example, might have
noticed that this is a third-degree connection or a second-degree connection. Ask so and so to
introduce you, and so on. This is exactly what it is doing.

It builds up a graph internally, which has all of these connections and then based on that it is
able to say, if B wants to talk to C, the best way, rather than just sending C an email and
hoping that C will respond is to ask A, can you introduce me and then C is more likely to
respond. So, in these kinds of graph structures, the kind of queries that you are interested in
are not really give me all people with the name starting with A, instead, it is, what is the
fastest way to get from B to C.

It could also be a map, a world map, and you want to find out the shortest path to get from
one place to another or the cheapest set of flights that will get you to London. So, things of
that sort are graph-oriented queries, and there are databases that sort of specialize in that. I
mean, there is something called Neo4J and there are other databases as well, which come up
in different contexts that sort of specialize in this kind of data storage and retrieval.
(Refer Slide Time: 13:06)

And one last thing is so called, Time Series Databases, Now, these are very application
specific, they are typically used to store some metric or values as a function of time. They are
useful for things like log file analysis, performance monitoring, and so on. And the kind of
queries that you are likely to face over here will be something like, how many hits were there
to this website between, let us say, January and March of this year?

So, it should be able to do a very fast time-based query or it might be that, you know, what is
the average number of requests per second, so time sort of becomes something fundamental.
The problem is if you store every single query and just sort of go back and say I will put
everything into this huge B-Tree, and then I just have to search through it, because time after
all is ordered.

I mean, there is a clear sense of order over there, that is not very efficient for this kind of
structure. There are better ways of storing it, and that is where the so called TSDB, the time
series database shines. There are many examples of this. And part of the thing about TSDB
is, that you may not want to store data indefinitely.

Let us say after one year, I do not really want to store every single thing that happened, every
single query that came on my website, I just want an aggregate. I want to know how many
queries were there in December, rather than on December 13, between 1 PM and 2 PM. So
things like RRD tools Round Robin Database tool that was more or less the first one that you
started applying these concepts on a large scale.
But nowadays, there are many things such as influxDB, Prometheus, and so on, which are
coupled with search engines, elasticsearch, grafana various things that you might come across
at various points, whose main job is to sort of help with analytics, so they can sort of take in
logfile data and construct time series databases out of it, and then allow you to have very
interesting queries on your data that come up with, good predictors of what is likely to
happen or ways by which you can optimize your application better.

(Refer Slide Time: 15:20)

So, after all of this. Now, clearly those of you who have read more about these things, you
would have come across this term, NoSQL, in the context of most of these alternative data
storage mechanisms that I mentioned earlier. Now, NoSQL, what exactly is it in the first
place? It started out as an alternative to SQL.

So, it was pretty much when they came out with document databases, they said, look, we are
able to store data, but we do not need that tabular structure, so this is not an SQL database,
this is a NoSQL database. But SQL is just a query language. It can be adapted for any kind of
query. So, including from a document database.

So, what you will find these days is that even for something like MongoDB, the query
structure that is used is very similar to what is used in SQL. You have select * and various
kinds of things of that sort, which are clearly inspired by SQL, if not directly, a derivative of
that. So nowadays, you see that, the term is more appropriately described as not only SQL.

What they are trying to convey here is that this is not limited to the kind of SQL queries that
are possible on a RDBMS, and you do not in particular, need to worry about things like inner
joins, outer joins, and various other ways of constructing a query. What you should be
looking at is, are there other variants of the query language, that are more efficient at looking
up things in the particular data store that we are using here?

So, NoSQL, and I want to emphasize this, it is not that it is a, it is not sort of saying this is not
SQL or that it is sort of violating any of the principles of SQL or that you it is not an
RDBMS. Generally speaking, yes, that is true, but that is not the point. NoSQL is just sort of
saying that something which is described as a NoSQL database, usually has something
different in the way the storing data, which means that using the traditional SQL type of
queries, may not be the best way to handle it.

(Refer Slide Time: 17:33)

And one thing, which again, comes up in this context, is there is this notion called ACID,
which once again, you might have come across in the database course, and it essentially
relates to one of the core principles of databases, which is what is called a transaction. And a
transaction could be something like create a new user entry. So, what does that mean? It
means basically, provide space in the database, put in the user's name, ID, address, phone
number, whatever all of that information, and make sure it has got stored into the database.

And ACID essentially refers to four of the principles that are there, that a particular operation
such as this should be atomic. Either the entire user entity got created or the database was left
unchanged. That it is consistent, meaning that you cannot have two copies of the database
having different information at any given point in time.
Isolated, one series of transactions should not sort of interfere with another series of
transactions, which is happening. The result of a transaction should be independent of
whether something else happened in between or not. And finally, durable, also to indicate
that it should actually get saved to permanent storage at some point.

Now, in all of these, this consistent part of it is probably the hardest one for a lot of databases
to manage, especially, when you want to start scaling. That is when you want to increase the
number of entries or increase the number of servers that are going to answer a query. And
what happens is that many NoSQL databases sacrifice some part of this ACID. For example,
they use something called eventual consistency, instead of direct consistency in order to get
better performance.

That is not to say that, NoSQL means, it is not ACID compliant. There can be ACID
compliant, NoSQL databases. In other words, I can take a document store and say that, I still
need to have the ACID properties on it. There is nothing preventing that from happening. So,
it is not that a NoSQL database necessarily violates acid or that anything that violates acid is
a NoSQL database, you need to be a little careful about judging what is or is not a NoSQL
database.

Unfortunately, it also suffers from a little bit of buzzword hype. Meaning that people sort of
want to call something a NoSQL database just so that people will look at it. That is what is
more important is, what does it actually give you, which is beneficial?

(Refer Slide Time: 20:08)


So, why do people even want to sort of break the ACID rules? Because after all, they sound
like a good idea. The main reason is that consistency is hard to meet, especially, when you
want to scale up or scale to multiple servers. And there is another principle called eventual
consistency, which is actually much easier to meet. So, to take an example, let us say that A
and B are two people one is located in India, the other is located in the US and they both add
C, as a friend on Facebook, the order in which it actually gets reflected in Facebook's
database does not matter too much.

So, it might be that for a short amount of time, someone looking at the page would see C is a
friend of A, but does not show up on B's friend list, but somebody else might see that, C
shows up on B's friend lists, but not on A's. After some amount of time, C will show up on
both that is what eventual consistency is. It does not sort of guarantee how fast or there may
be guarantees on within so much time, but it is definitely not going to be instantaneous.

And the whole idea is that, you know, this is, in general, not a catastrophe. Of course,
nowadays, you cannot say for sure. People might get upset that they were on one person's
friend list, but not the other, but that is not what eventual consistency is trying to solve.

Now, on the other hand, a financial transaction where I debit 100 rupees from A’s account,
and credit it to B’s account, they have to be absolutely ACID-compliant. The consistency is
paramount over there. At no event, should I be able to look into the database and see that 100
rupees was deducted from a but not added to B. It has to be somewhere. So, that thing has to
be very clear. And especially financial transactions, therefore, never really go for things like
NoSQL. On the other hand, a lot of web applications these days do not require this kind of
consistency. And that is why NoSQL is becoming popular now.
(Refer Slide Time: 22:12)

Now, a quick word on how data actually gets stored. And this is that, as far as possible you
would actually like to store data completely in-memory. Meaning that in the RAM of the
system. It is obviously going to be fast, but does not really scale across multiple machines for
the simple reason that if I have different machines, they do not share the same physical
memory, so I cannot replicate memory as easily as I can let them both share a disk for
example. But now, if I want to store data out to a disk, remember, the high latency is
associated with disk. This means that, I need to change the data structures and organize my
data differently.

You might also like