]"
]
},
"execution_count": 114,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"movies = soup.findAll('h3', class_='lister-item-header') \n",
"movies[0:5]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The findAll method creates a list where each entry contains the HTML that’s captured within the h3 tag and list-item-header class. \n",
"By taking a deeper look at the first movies HTML and see that the movie title can be found under the first a tag.\n",
"\n",
"To capture this attribute we can loop through all movies and either call findAll and grab the first element of the list, or we can use the find method which automatically grabs the first tag it finds. Thus, we can construct a list of all movie titles (with a little help from list comprehensions for efficiency) through:"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['Jai Bhim',\n",
" 'The Shawshank Redemption',\n",
" 'The Godfather',\n",
" 'Soorarai Pottru',\n",
" 'The Dark Knight',\n",
" 'The Godfather: Part II',\n",
" '12 Angry Men',\n",
" 'The Lord of the Rings: The Return of the King',\n",
" 'Pulp Fiction',\n",
" \"Schindler's List\"]"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"titles = [movie.find('a').text for movie in movies]\n",
"titles[0:10]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Release years can be found under the tag span and class lister-item-year text-muted unbold. To grab these, we can follow a similar approach as before:"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['(2021)',\n",
" '(1994)',\n",
" '(1972)',\n",
" '(2020)',\n",
" '(2008)',\n",
" '(1974)',\n",
" '(1957)',\n",
" '(2003)',\n",
" '(1994)',\n",
" '(1993)']"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"release = [movie.find('span', class_='lister-item-year text-muted unbold').text for movie in movies]\n",
"release[0:10]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Extracting Numerical Values:\n",
"\n",
"In the case of IMDB ratings, number of votes and box office earnings we can see that while these may be available as string values, can also grab the actual numerical values from the data-value attribute within each respective tag.\n",
"\n",
"Here’s the IMDB rating of The Godfather: \n",
"\\
"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'9.3'"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"soup.find('div', class_='inline-block ratings-imdb-rating')['data-value']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the case of number of votes and earnings we don’t have a class attribute to filter for. Here are the number of votes and estimated box office earnings:"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['166955', '2524608', '28,341,469', '1738105', '134,966,411', '105969', '2474861', '534,858,444', '1206022', '57,300,000', '745533', '4,360,000', '1742556', '377,845,905', '1945178', '107,928,762', '1290110', '96,898,818', '2219102', '292,576,195', '381254', '34422', '1985641', '37,030,102', '1763818', '315,544,750', '1948453', '330,252,182', '729757', '6,100,000', '80169', '111744', '1574783', '342,551,365', '1819851', '171,479,930', '1092313', '46,836,394', '1226372', '290,475,067', '968073', '112,000,000', '704737', '53,367,844', '33253', '1670884', '188,020,017', '731227', '7,563,397', '711808', '10,055,859', '1318236', '216,540,909', '1228259', '136,801,374', '664080', '57,598,247', '1549651', '100,125,643', '1357805', '130,742,922', '1297940', '322,740,140', '51376', '333835', '269,061', '437425', '27144', '784073', '13,092,000', '813495', '13,182,281', '1268744', '53,089,891', '1262628', '132,384,315', '787132', '32,572,577', '1426161', '187,705,427', '1081233', '6,719,864', '1044282', '23,341,568', '1103636', '19,501,238', '1002551', '422,783,777', '1050193', '204,843,350', '249906', '11,990,401', '259173', '1136081', '210,609,762', '318833', '5,321,508', '642033', '32,000,000', '473008', '36,764,313', '550923', '1,024,560', '232696', '163,245', '178698', '19,181', '80529', '80141', '1,661,096', '1126129', '335,451,311', '37320', '37822', '236671', '5,017,246', '176504', '12,391,761', '460477', '190,241,310', '993360', '858,373,000', '966948', '678,815,482', '453653', '209,726,015', '1459854', '162,805,434', '1607516', '448,139,099', '375858', '6,532,908', '184301', '1,223,869', '1061740', '223,808,164', '377926', '11,286,112', '551409', '707,481', '1187303', '25,544,867', '372970', '2,375,308', '930103', '248,159,971', '964675', '44,017,374', '640271', '83,471,511', '838483', '78,900,000', '41116', '473614', '275,902', '119338', '8,175,000', '190087', '30035', '536,364', '214280', '216460', '288,475', '46927', '898,575', '522938', '159,227,644', '54677', '4,186,168', '39268', '52258', '35364', '39058', '311464', '687,185', '235572', '7,098,492', '167495', '6,857,096']\n"
]
}
],
"source": [
"votes_earnings = soup.findAll('span', {'name':'nv'})\n",
"print([ve['data-value'] for ve in votes_earnings])"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['166955', '28,341,469', '134,966,411', '2474861', '1206022']\n",
"['2524608', '1738105', '105969', '534,858,444', '57,300,000']\n"
]
}
],
"source": [
"votes = []\n",
"earnings = []\n",
"idx = 0\n",
"while idx < len(votes_earnings)-1:\n",
" votes.append(votes_earnings[idx]['data-value'])\n",
" earnings.append(votes_earnings[idx+1]['data-value'])\n",
" idx+=2\n",
"print(votes[0:5])\n",
"print(earnings[0:5])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Nested Values\n",
"\n",
"In the case where the data we need is located within multiple levels of generic tags, we’ll need to dig into this nested structure to extract what we need.\n",
"\n",
"In the case of the movie directors and actors we’ll need to do just that. From inspecting the HTML we see that the director information is located within an initial p tag and thereafter an a tag — both without class attributes making it necessary to unnest the data. We’ll do this by calling find and findAll repeatedly.\n",
"\n",
"Since the director is the 1st a tag, we can extract this information through:"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"---\n",
"IMDb Top 250 Movies chart\n",
"IMDb Top 250 Movies chart\n",
"---\n",
"T.J. Gnanavel\n",
"T.J. Gnanavel\n",
"Suriya\n",
"Suriya\n",
"Lijo Mol Jose\n",
"Lijo Mol Jose\n",
"Manikandan\n",
"Manikandan\n",
"Rajisha Vijayan\n",
"Rajisha Vijayan\n",
"---\n",
"Frank Darabont\n",
"Frank Darabont\n",
"Tim Robbins\n",
"Tim Robbins\n",
"Morgan Freeman\n",
"Morgan Freeman\n",
"Bob Gunton\n",
"Bob Gunton\n",
"William Sadler\n",
"William Sadler\n"
]
}
],
"source": [
"for l1 in soup.findAll('p')[0:10]:\n",
" if l1.find('a'):\n",
" print('---')\n",
" for l2 in l1.findAll('a'):\n",
" print(l2)\n",
" print(l2.text)"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[]"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"actors = [actor.text for actor in soup.findAll('p')[2].findAll('a')]\n",
"actors"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Creating functions to automate information scraping on IMDB:"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [],
"source": [
"def numeric_value(movie, tag, class_=None, order=None):\n",
" if order:\n",
" if len(movie.findAll(tag, class_)) > 1:\n",
" to_extract = movie.findAll(tag, class_)[order]['data-value']\n",
" else:\n",
" to_extract = None\n",
" else:\n",
" to_extract = movie.find(tag, class_)['data-value']\n",
"\n",
" return to_extract\n",
"\n",
"\n",
"def text_value(movie, tag, class_=None):\n",
" if movie.find(tag, class_):\n",
" return movie.find(tag, class_).text\n",
" else:\n",
" return\n",
"\n",
"\n",
"def nested_text_value(movie, tag_1, class_1, tag_2, class_2, order=None):\n",
" if not order:\n",
" return movie.find(tag_1, class_1).find(tag_2, class_2).text\n",
" else:\n",
" return [val.text for val in movie.find(tag_1, class_1).findAll(tag_2, class_2)[order]]\n",
"\n",
"\n",
"def extract_attribute(soup, tag_1, class_1='', tag_2='', class_2='',\n",
" text_attribute=True, order=None, nested=False):\n",
" movies = soup.findAll('div', class_='lister-item-content')\n",
" data_list = []\n",
" for movie in movies:\n",
" if text_attribute:\n",
" if nested:\n",
" data_list.append(nested_text_value(movie, tag_1, class_1, tag_2, class_2, order))\n",
" else:\n",
" data_list.append(text_value(movie, tag_1, class_1))\n",
" else:\n",
" data_list.append(numeric_value(movie, tag_1, class_1, order))\n",
"\n",
" return data_list"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Creating a Dataframe with the information"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"