57.1 - Introduction To Databases - mp4
57.1 - Introduction To Databases - mp4
Before we go and learn about SQL, which is very, very interesting topic. But before we go to
SQL, I'll give you a brief overview of what databases are. Remember, databases is a full
fledged subject at undergraduate and graduate level computer science. So I'll not be able to
cover everything here, but I'll cover the basics that we need so as to understand SQL better.
Right? So databases, you might have heard of databases like, you might have heard these
names like an Oracle database or a MySQL database or a Microsoft SQL server. These all are
popular databases. These all are very popular databases. And all these are often referred to
as relational databases. They're often referred to as relational databases. That obviously
raises the question, are there something called non relational databases? Yes, there are
something called as non relational databases that have become more popular more
recently. That have become more popular more recently. Traditionally, since the 1970s and
60s, relational databases are the most widely used databases and we will limit this
discussion only to relational databases. But whatever we learn about SQL, some of the
concepts that we learn are also useful. If you are using a non relational database. For
example, there are non relational databases like MongodB, which is a very interesting
database structure which doesn't follow some of the concepts of a relational database. I will
not go deep into what a relational database is, what a non relational database is, what are
the differences? I'll not go into it right now because our focus is about learning SQL. Some of
the lessons that we learn about SQL, even though SQL was designed for relational databases,
the non relational database community also has adopted SQL, or variations of SQL for
processing data or for obtaining data, right? So the big question here is, why databases?
Why use a database? Now, before we go into this question, what are the alternatives? Let's
understand what is an alternative that we have? Instead of storing your data in a database,
why don't I just store my data in a text file? In a text file, like a CSV file, right? These are
called flat text files or these are called as flat files because you literally store your data in a
flat structure, right? Basically, if this is your file, you save your variable one or column one,
column two, column three, column four, so on and so forth, right? You just store it in a flat
fashion because this is like a flat, this is like a flat display of data, right? That's the
alternative that we have. So why don't we store our data in a flat text file like a CSV? Why do
we use a database. Why can't we just use a flat file? Now, databases, typically, typically
databases provide us lots of nice software. So databases, in a nutshell, are mostly software
that help us achieve lot of things that simple flat text files can't. For example, a database
gives us a lot of very interesting tools. It makes our life simpler, easier, right? A lot of times
faster, reliable and also secure sometimes, right, it makes access to data. Suppose if I have
data sitting here, databases make access to the data far simpler and easier through a
standard interface called as SQL, which we'll learn later. It will also make it much faster. For
example, imagine if I have a text file where I want to determine all the values. For example, I
have column two. Let's say I have column two, right? In this CSV file. This is my CSV file,
right? I want to take all the values. Let me take an example here, right? So that you'll better
understand it. Suppose I want to get all the rows. Let's assume all these are the rows. This is
row one, row two, row three, row four, so on and so forth. Suppose I want to obtain all the
rows of data, all the rows of data where column two is greater than 100. Let's assume I want
to do that. How do we do it? In a text file, we load this whole text file. We load this whole
text file using any programming language, C, c plus plus, Java, or python, right? We load this
text file using file handling tools. Then we go through each of these rows to see if the value
here is greater than 100 or not. And that's very time consuming. A database uses some very
interesting software tricks called as indexing to make this type of retrieval of data way, way
faster. Right. Also, to retrieve this data, you need to be a good programmer because you
need to first load the data in C, c plus plus, Java or python. And you need to explain how to
actually obtain these rows. You have to write code saying that, okay, first load this data,
then go row by row. When you go row by row, check if each of these values is greater than
100 or not. If it is, then return, or else don't return. So you have to have knowledge of lots of
programming. You have to be able to explain how to retrieve this data, which many people
may not have. So a database provides a very simple and easy to use language called as SQL,
which we'll learn later in this chapter, with which you just have to say, okay, all you have to
say is, I want. Literally, this is what you have to say. You have to say, I want all the rows
where this column value is greater than 100, without worrying about anything else, the
database takes care of retrieving the data for you. You don't have to know about file
handling, you don't have to know advanced programming, you don't have to worry about if
this table is very large. You don't have to know concepts in operating systems, et cetera. All
you have to do is write a single line, literally a single line of SQL query. Everything is taken
for you, taken care for you. So database consists of lots of interesting software to make your
life simpler, easier. Also, obtaining this data is way faster because databases do lot of
advanced data structure build lot of advanced data structures called indexing data
structures, which make it extremely fast to retrieve this data. It's also reliable and secure
because imagine if your hard disk crashes, your whole data is gone for a toss. Databases
backup your data at multiple places to ensure that your data can be accessed reliably and
securely, even if there are hardware crashes, even if your hard disk typically crashes, and
databases typically achieve that using a wide spectrum of techniques. A very simple
technique is suppose you have three hard disks, and by the way, databases are hosted
where you have lot of data storage, okay? If this table is stored here, a copy of it is also
stored here, and another copy is stored here. This is called triplicate storage. Now, if this
hard disk fails, that's okay. We have two other hard disks where the data is readily available,
right? So database as a software takes care of all of this internally for you so that you don't
have to worry. That's one of the biggest advantages of using databases over simple flat files,
right? Simply speaking, it makes obtaining data that we want way simpler, way easier, way
faster, way more reliable and secure. That's the reason why we use databases over simple
flat text files, okay? Having learned this basics of why we need a database, and the fact that
relational databases are extremely popular. Remember, some of these companies like
Oracle make most of their money through databases, and they're a multi billion dollar
companies. A lot of interesting startups like MongoDB have become billion dollar startups
by leveraging new advances called non relational databases. So databases, whatever you do,
whether you're booking a flight ticket, whether you're booking a flight ticket or a train
ticket, whether you're booking a movie, whether you're booking a movie ticket, all of them,
or even you're accessing this website, you're accessing this website, right? All of that
happens through a database. We use a mySql database ourselves. At the back end of this
website, right? So databases are probably the most used piece of software that we don't
know about. We know about Microsoft operating systems, we know about Android
operating systems, but databases are that piece of software which most of us don't know
but which runs our life. Literally speaking. All of your Android phones, your Android,
iPhones, all of them actually have a database internally to store your contacts information,
right? That has to be stored somewhere, right? There is actually a database sitting in each of
your Android phones and iPhones. That's how powerful databases are, right? So now let's
go into a simple concept called tables. So in a relational database, in a relational database, all
of our data is stored across multiple tables, okay? Let's take a simple example to understand
why we are storing the data in tables. Tables is very simple. We all understand what a table
is, right? Suppose if I have a flat text file, like a comma separated value file, right? We know
that, okay? This is actually like a tabler structure. If I'm storing my data in a CSV file or a
comma separated value file, what am I doing? I'm storing some data with comma
separation. This is value one, value two, value three, value four, so on and so forth. This you
can think of logically as a table, right? Each of them as columns, and then you have rows
here. So tables are anyway fundamental to all of data storage that we do today, whether it's
a flat file or a relational database. But there is a very interesting difference here. Imagine,
let's take an example, and we'll use this example multiple times across this chapter. But let's
take a very simple example here. Let's assume I have some data about movies, okay? So let's
assume for each movie I have a movie ID, right? I have a movie name, right? Let's assume I
have a movie name. I have the year in which this movie was released for each actor, let's
assume I have an ID for the actor, I have the actor name, right? And the gender of the actor.
Now look at this. When I want to store some data, let's assume movie ID is, let's say one
movie name is, let's say Pirates of the Caribbean. Pirates of the Caribbean. I don't remember
the year it was launched. Let's assume the year is some 2008. Let's say actor one. So now
the Pirates of the caribbean movie have multiple actors, right? The actor iD, let's assume is,
let's say one, not one. The actor name is I don't remember the actors. Also, I don't know the
character names. Some name, okay, some name, name one. Gender is, let's say male. Now,
since there are many actors in this movie, right, I'll have to repeat the same data. Pirates of
the Caribbean 2008. Now, suppose there is a second actor. Again, name two. Let's assume
this is also a male character. Now, if I have ten actors, if I have ten actors in the movie, in the
movie, I'm literally repeating the data ten times. If you look at this, this data that I have, this
data that I have will be repeated ten times. Of course, the actor iD, everything will change.
The actor id will be. Will change. Name three, let's say female one, not four, name four, some
female, so on and so forth. While. While this part of the information changes for these ten
rows, right? For these ten rows, the actor details will change, but the movie details, we are
simply copying the same thing. Just imagine how much space we are wasting by just
duplicating. This is all duplication of data. This is all duplication of data. So instead of
storing all of this in one table, what if I store this in multiple tables? That's a very key idea of
relational databases, right? And this idea is called normalization. Normalization. And this
normalization is a concept of normalization in databases. And this is a very wide subject,
which I'll not go into in detail. I mean, I can spend 10 hours just explaining about database
normalization, right? I'll not go into it. But the core idea is this. Given this simple example,
there is lot of duplication of data. Right? Now, to avoid this duplication, relational databases
come up with a very simple scheme. It says, let's create multiple tables. Instead of storing
everything in one table, t, let's assume this table is t. Let's break it up into multiple tables.
Let's say t one will contain the movie id, the movie name, and the year of release. This is t
one, right? Let's assume t two. Movie two, sorry. Table two contains actor id, actor name,
and actor gender. Now, table three will contain movie iD and actor id. I've broken this up
into three tables wherein. Now, remember, now look at this, right? For each movie here in
this table, let's look at this table, right? Let's look at this table, right? These are the columns
of the table, right? Now let's look at this table. For each movie, only one row will be present.
In this table. In this table, only one row will be present. So we are no more going to
duplicate this data. And each movie name can be uniquely identified. See, each of these rows
each of these rows can be uniquely identified using this movie id, right? In this table. Now,
my second table, my second table is my actor Id. It contains only three fields. Actor id, actor
name, and gender. Now, here also, for every actor, there will only be one row, and each
actor can be uniquely identified again using my actor iD. Because two actors can have the
same name, two movies can have the same name. Multiple movies can be released in the
same year. Of course, there are many people with the same gender. So the unique way to
identify each of these rows in the table can be done using movie id. Similarly, in this table,
each row will have a unique actor, and each actor will have a unique actor id. Right? Now,
let's come to the third table. Now, in the third table, in the third table, I will store for each
movie and for each actor. Suppose if I have movie id, one. So let's take this example only,
right? I'll store one, comma, one, not one. Here I'll store one and one, not one. I'll store one,
one, not two. Because actor one, not 1121, not three, one, not four. All of them acted in
movie id, one, right? Here. Neither movie id, one, nor actor id, one will uniquely define a
row. But the combination of both of them will define uniquely every row, right? You will not
find one, comma, one, not one, and again, one, comma, one, not one. This is of no use. This is
not giving us any additional information. But both these rows together can uniquely identify
each of the rows here, right? So now I'll explain what these concepts are. There is a reason
why I'm highlighting all of them in green. Because movie id uniquely determines each of the
rows here. Movie id is called the primary key. It is called the primary key in databases.
Similarly, actor ID is referred to as a primary key. Now, actor ID is called the primary key of
table two. The primary key of table three is a combination of both these columns, right? So
both these columns together constitute the primary key of this table, right? So we have
three tables. The first table has movie ID as the primary key, because each row can be
uniquely identified using the primary key. The second table, the first table is basically the
movie table. The second table is the actor table. The third table is the casting table, where
we are saying which actor was casted in which movie, right? So these are my unique keys.
But there is also another concept called as foreign key. A foreign key basically means that
whatever. So here, these two are foreign keys. A foreign key basically means that whatever
values that I have here should also be present here, right? If you have a movie Id here which
is not present in the movie table. That is wrong information. So movie id in table three is a
foreign key with respect to table one because this movie id should be present here,
otherwise it's not useful. Similarly, actor id here is a foreign key to the actor id column in my
actor table, right? So these two concepts are very important. The first one is called a
primary key. The second one is called a foreign key. A primary key uniquely identifies each
of the rows in a table. A foreign key tells us what all values can be present in a column of
data. Right. Very interesting concepts. We will use all of these again. Remember, I have
covered a semester long subject in a very short duration here to give you an overview of a
few concepts so that we can understand SQL better. So we have understood what are the
databases? What are the popular databases here? Where do we use these databases on a
day to day basis? Why we are using a database in the first place? Why not just store the data
in a simple text file? And most importantly, why are tables very important? Why is breaking
up data or normalization across multiple tables very important in databases? And we have
shown how to break this data into multiple smaller tables so that we are not duplicating
data, thereby saving space. Okay. Given this background, now we are much more well
equipped to go into some concepts of SQL and database.