Webscraping IMDB Tutorial and Questions Sum24-2
Webscraping IMDB Tutorial and Questions Sum24-2
R Saidi
Scrape the IMDB website to create a dataframe of information from 2019 top
100 movies
Use the following URL from IMBD movies of 2019
https://www.imdb.com/search/title/?title_type=feature&release_date=2019-01-01,2019-
12-31&count=100
#Specifying the url for desired website to be scraped
url <- 'https://www.imdb.com/search/title/?
title_type=feature&release_date=2019-01-01,2019-12-31&count=100'
{xml_nodeset (6)}
[1] <h1 class="ipc-title__text">Advanced search</h1>
[2] <h3 class="ipc-title__text">1. Midsommar</h3>
[3] <h3 class="ipc-title__text">2. Once Upon a Time... in Hollywood</h3>
[4] <h3 class="ipc-title__text">3. The Gentlemen</h3>
[5] <h3 class="ipc-title__text">4. Avengers: Endgame</h3>
[6] <h3 class="ipc-title__text">5. Parasite</h3>
#Remove the first and last rows - they are not movie titles
rank_title_data <- rank_title[-c(1,27)]
[1] 25
#should be 25
[1] "A couple travels to Northern Europe to visit a rural hometown's fabled
Swedish mid-summer festival. What begins as an idyllic retreat quickly
devolves into an increasingly violent and bizarre competition at the hands of
a pagan cult."
[2] "As Hollywood's Golden Age is winding down during the summer of 1969,
television actor Rick Dalton and his stunt double Cliff Booth endeavor to
achieve lasting success in Hollywood while meeting several colorful
characters along the way."
[3] "An American expat tries to sell off his highly profitable marijuana
empire in London, triggering plots, schemes, bribery and blackmail in an
attempt to steal his domain out from under him."
[4] "After the devastating events of Avengers: Infinity War (2018), the
universe is in ruins. With the help of remaining allies, the Avengers
assemble once more in order to reverse Thanos' actions and restore balance to
the universe."
[5] "Greed and class discrimination threaten the newly-formed symbiotic
relationship between the wealthy Park family and the destitute Kim clan."
[6] "April 6th, 1917. As an infantry battalion assembles to wage war deep in
enemy territory, two soldiers are assigned to race against time and deliver a
message that will stop 1,600 men from walking straight into a deadly trap."
[1] 25
#It should be 25
[1] "2h 28m" "2h 41m" "1h 53m" "3h 1m" "2h 12m" "1h 59m"
length(converted_runtimes)
[1] 25
summary(converted_runtimes)
Check to make sure movies match with runtimes with a temporary data frame
# Display the titles of movies with missing runtimes and their corresponding
runtimes
df_1 <- data.frame(Title = title_data, Runtime = converted_runtimes)
head(df_1)
Title Runtime
1 Midsommar 148
2 Once Upon a Time... in Hollywood 161
3 The Gentlemen 113
4 Avengers: Endgame 181
5 Parasite 132
6 1917 119
[1] " (411K)" " (859K)" " (409K)" " (1.3M)" " (975K)" " (685K)"
[1] " 411K" " 859K" " 409K" " 1.3M" " 975K" " 685K"
rank title
1 1 Midsommar
2 2 Once Upon a Time... in Hollywood
3 3 The Gentlemen
4 4 Avengers: Endgame
5 5 Parasite
6 6 1917
description
1 A couple travels to Northern Europe to visit a rural hometown's fabled
Swedish mid-summer festival. What begins as an idyllic retreat quickly
devolves into an increasingly violent and bizarre competition at the hands of
a pagan cult.
2 As Hollywood's Golden Age is winding down during the summer of 1969,
television actor Rick Dalton and his stunt double Cliff Booth endeavor to
achieve lasting success in Hollywood while meeting several colorful
characters along the way.
3 An American expat tries to
sell off his highly profitable marijuana empire in London, triggering plots,
schemes, bribery and blackmail in an attempt to steal his domain out from
under him.
4 After the devastating events of Avengers: Infinity War (2018), the
universe is in ruins. With the help of remaining allies, the Avengers
assemble once more in order to reverse Thanos' actions and restore balance to
the universe.
5
Greed and class discrimination threaten the newly-formed symbiotic
relationship between the wealthy Park family and the destitute Kim clan.
6 April 6th, 1917. As an infantry battalion assembles to wage war
deep in enemy territory, two soldiers are assigned to race against time and
deliver a message that will stop 1,600 men from walking straight into a
deadly trap.
runtime votes
1 148 411K
2 161 859K
3 113 409K
4 181 1.3M
5 132 975K
6 119 685K
rank title
1 1 Midsommar
2 2 Once Upon a Time... in Hollywood
3 3 The Gentlemen
4 4 Avengers: Endgame
5 5 Parasite
6 6 1917
description
1 A couple travels to Northern Europe to visit a rural hometown's fabled
Swedish mid-summer festival. What begins as an idyllic retreat quickly
devolves into an increasingly violent and bizarre competition at the hands of
a pagan cult.
2 As Hollywood's Golden Age is winding down during the summer of 1969,
television actor Rick Dalton and his stunt double Cliff Booth endeavor to
achieve lasting success in Hollywood while meeting several colorful
characters along the way.
3 An American expat tries to
sell off his highly profitable marijuana empire in London, triggering plots,
schemes, bribery and blackmail in an attempt to steal his domain out from
under him.
4 After the devastating events of Avengers: Infinity War (2018), the
universe is in ruins. With the help of remaining allies, the Avengers
assemble once more in order to reverse Thanos' actions and restore balance to
the universe.
5
Greed and class discrimination threaten the newly-formed symbiotic
relationship between the wealthy Park family and the destitute Kim clan.
6 April 6th, 1917. As an infantry battalion assembles to wage war
deep in enemy territory, two soldiers are assigned to race against time and
deliver a message that will stop 1,600 men from walking straight into a
deadly trap.
runtime votes votes_thous
1 148 411K 411
2 161 859K 859
3 113 409K 409
4 181 1.3M 13000
5 132 975K 975
6 119 685K 685
rank title
1 1 Midsommar
2 2 Once Upon a Time... in Hollywood
3 3 The Gentlemen
4 4 Avengers: Endgame
5 5 Parasite
6 6 1917
description
1 A couple travels to Northern Europe to visit a rural hometown's fabled
Swedish mid-summer festival. What begins as an idyllic retreat quickly
devolves into an increasingly violent and bizarre competition at the hands of
a pagan cult.
2 As Hollywood's Golden Age is winding down during the summer of 1969,
television actor Rick Dalton and his stunt double Cliff Booth endeavor to
achieve lasting success in Hollywood while meeting several colorful
characters along the way.
3 An American expat tries to
sell off his highly profitable marijuana empire in London, triggering plots,
schemes, bribery and blackmail in an attempt to steal his domain out from
under him.
4 After the devastating events of Avengers: Infinity War (2018), the
universe is in ruins. With the help of remaining allies, the Avengers
assemble once more in order to reverse Thanos' actions and restore balance to
the universe.
5
Greed and class discrimination threaten the newly-formed symbiotic
relationship between the wealthy Park family and the destitute Kim clan.
6 April 6th, 1917. As an infantry battalion assembles to wage war
deep in enemy territory, two soldiers are assigned to race against time and
deliver a message that will stop 1,600 men from walking straight into a
deadly trap.
runtime votes votes_thous votes_in_thous
1 148 411K 411 411
2 161 859K 859 859
3 113 409K 409 409
4 181 1.3M 13000 13000
5 132 975K 975 975
6 119 685K 685 685
Problem 1: Based on the scraped 2019 IMDB movie data frame, create a
histogram that shows runtime on the x-axis. Be sure to provide a title, axis label,
and caption for the data source.
Alternatively you may create a scatterplot of runtime versus number of votes.
## ggplot
Problem 3: In the runtime of 116-135 mins, which movies are from the lowest
ranked 5 out of 25?
Again, you must use the filter function to get the exact movie which answers this question.
Be sure to state the rank and runtime for each movie.
# use filter code here