0% found this document useful (0 votes)
45 views3 pages

57.4 - IMDB Dataset - mp4

sql

Uploaded by

NAKKA PUNEETH
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views3 pages

57.4 - IMDB Dataset - mp4

sql

Uploaded by

NAKKA PUNEETH
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

Now we'll look at a data set and we'll use this data set.

It's actually a movie data set. It's a


very, very interesting and fairly large data set that we'll use as an example to show you
some SQL queries, right? So to show you some SQL queries in action, right. We'll also
motivate why each of these SQL queries are needed from a slightly real world perspective.
To give you a real world perspective, we chose to use the IMDb data set. For those of you
who don't know what IMDb is, IMDb is. Okay, let me go here. IMDb is a website which is
owned by Amazon.com. And IMDb has lot of very, very interesting data about movies. The
website is IMDb.com, and we have gotten some data from IMDb, and it's beautiful data. I'll
explain you the data in a little while. But this website itself has humongous amount of data
about almost every major film that is released. Enormous amount of information. You'll be
surprised how much information you have. Indian movies, you have movies in almost all the
major languages documented in IMDb. You have a lot of data about it. For example, you have
the cast and crew information when the movie was released. All that information is
available on IMDb, ratings for the movies and things like that. So the data set that we have.
Right, the data set that we have right now is a movie data set collected from IMDb. Just to
give you a sense of the data set, we have data about 388,269 movies. These are movies from
1888 up to 2008, right? I mean, almost one of the earliest movies up to way, fairly recent
data up to 2008, just ten years ago. And we have 388,269 movies here. That's the amount of
data that we have in this database. Right? So we have data for 817,718 actors, right? And
similarly, we have data for 86,880 directors of these 388,000 plus movies. We are giving
you a sense of the scale of data. This should give you a sense that this data is actually truly
real world data, fairly nontrivially large amount of data. Right. So this is the data set that
we'll use before we go and load the data set and write some SQL queries on the data set.
Let's understand how this data set is organized across multiple tables in a relational
database. As we discussed earlier, in a relational database, we need to have lots of tables, or
we need to have tables so that we don't duplicate the data. So let's go and look at the
structure of each of these tables. We have multiple tables in this database. And we'll go and
look at each of these tables and understand what are the primary keys. What are the foreign
keys? What is happening here? We'll try to understand the data set itself here. Right. So let's
go in here. Okay, sounds good. So I have a table called directors. So this represents a table,
right? I have a table called directors, which has three columns. Id, first name, and last name.
Obviously, id is my primary key. I'll mark my primary keys in red. I'll circle them in red here.
Okay, so red circle basically implies primary key. Okay. I'll also use some other color,
probably. Let's use a blue color. Let's use a blue color for foreign key. Right? We understood
what is a primary key and foreign key earlier. Right? So in this table. In this table, I have a
table called directors, where I have the ID for each director. I have the first name and the
last name of the directors. Okay? This is the directors table. The second table I have is the
movies table, in which for each movie, there is a unique ID. So this becomes the primary key.
I have the name of the movie, the year in which it is released, and the rank. Right? Similarly,
let's look at actors. For each actor, I have the ID of the actor, I have the first name, the last
name, and the gender. Again here, the Id is the unique identifier, and hence, it's a primary
key. So we have a director's table, we have a movie stable, and we have an actor stable,
broadly speaking. Right? Now, which actor acted in which movie, which directed directed a
movie? All that information we have in other tables. Okay, let's go up. Okay, so we have one
table here called roles, right? The roles table. Look at this. The roles table has an actor iD, a
movie iD, and a role. Obviously, this actor id, given an actor ID and a movie id. An actor in a
movie will have only one role, typically. Right? So the primary key here is the actor id plus
the movie id. Because given the actor id and movie id, you can uniquely determine each of
the rows. So these two combined together form my primary key. But here comes the fun
part. This actor id should be present in the actors table, and this movie id should be present
in the movie table. That's why we have drawn these arrows. Look at this. So this is a foreign
key. This movie id is a foreign key. This arrow basically implies that this movie id should be
present in these iDs, right? Similarly, this actor Id is also a foreign key. Because every actor
id that is there in the roles table should be present in the actor stable, right? That's why you
have this arrow representing that this actor id is a foreign key to this id. These two IDs are
primary keys here. This is a primary key, and this is a primary key. This actor ID and movie
id combined together are the primary key for the roles table. And actor Id is a foreign key
here. Because this has to be one of the actor Ids which is there in the actors table. Similarly,
the movie Id here, every movie ID here should be present in the movies table. That's why
you have these arrows representing that these two are foreign keys from these tables.
Okay? The role table tells you which actor acted in which movie in what role. Right? Now,
let's go up. Similarly. Similarly, for each movie. For each movie, given a movie ID, I have the
genre of the movie. Whether it's a comedy movie, horror movie, thriller, action, adventure.
So I have the genre of the movie, right? Different people pronounce it differently. I'm not
sure the exact pronunciation. Some people say genre. Some people say genre. So the genre
of the movie is also available here. Again, the movie id is a primary key here. For every
movie ID, you could have. Again, this depends. Every movie ID could have multiple genres. A
movie ID. A movie could be partly like. Again, we'll see. We'll see. Which is the primary key
here when we look at the database in the MySQL database, right? Similarly, which director
directed a movie? Again, multiple directors could direct a movie. Let's not forget that. Right?
So this represents the director ID and this represents the movie iD. Similarly. So this is
movies underscore directors. This is movies underscore genres. And we have directors
underscore genres. So given a director and given a genre, right? What is the probability?
The third value. There are three columns here, right? There is a director ID, there is a genre,
and there is a probability. So given a director, let's say Steven Spielberg, the genre is, let's
say war cinema, what is the probability that this director directs movies of this genre?
Right? Whether it's zero, 2.3, whatever is the probability, the probability can be easily
computed by seeing what percentage of movies has this director directed for a given genre.
So these are the six tables we have. Just to quickly recap, we'll see what is a primary key and
what is not a primary key, et cetera. Obviously, this director ID. Obviously, this director ID is
a foreign key to the director table. This director ID is a foreign key to the directory. To the
director table. This is a foreign key to the movie table. This is also a foreign key to the movie
table. The primary keys we will see when we go into the database and see what each of
them are, right? Little later. But the foreign keys is simple. And these arrow marks simply
represent that foreign key. Look at this. Here you have director ID. Here also you have your
director ID. You have these two arrows. You have these arrows going to the director's table
showing that these director ids that you have here. These director ids that you have here
need to be present in the director table under the ID column. Okay, so in a nutshell, we have
the six tables, right? Directors, movies, actors, the roles, which says which actor acted in
which movie? We have seven tables. I'm sorry, not six, seven. Next one is which movie
belongs to which genre? Which movie is directed by which director or directors? And for
each director, what are the genres that the director typically directs in? Okay, this whole
thing is called the schema of the database. Schema basically means how the data is
organized. Schema in databases refer to all the tables, right? All the relationships between
tables, all the relationships and interconnections between tables. That's what a schema
represents. A schema says what are. So we have seven tables where this is a primary key,
where this is a foreign key, et cetera. All that information is called the schema in the
database terminology, right? So this is called the IMDb schema for a database. Our database
has the data from IMDb, all the movies, directors, actors database. And the schema in
English basically means the structure or the organization of something, right? So schema in
databases means the structure of all the tables, all the foreign key, primary key, type of
relationships between various tables. Given this insight about our tables, this is what we
know, and this is quite a lot of data. This is no laughing matter. We will use this database to
learn SQL and also to connect whatever we are learning to a real world problem. We'll try to
connect each of them to real world problems, either in website design or in data science.
We'll try to connect them so that you understand. Some of these queries that we will run
through SQL are very, very important. I'll try to give you real world context. And that's the
reason why we took the effort to use a real world data set to explain the concepts in SQL, so
that you better understand what is happening. And movies is something that most of us
watch. And of course, if not all the movies, all of us certainly enjoy a small set of movies. So
let's go through this data, learn SQL, and learn how simple SQL is to process this large
amount of data that we have split across seven tables. Right. We'll learn that. We'll learn
that in the rest of this chapter.

You might also like