0% found this document useful (0 votes)
14 views15 pages

Bollywood and Heart Data Analysis

The document presents a data analysis of Bollywood movies and heart disease data using Python libraries such as pandas and seaborn. It includes various analyses such as movie counts by genre and month, return on investment (ROI) calculations, and correlations between different variables like budget and box office collections. Additionally, it explores heart disease data, providing insights into the structure and content of the datasets.

Uploaded by

karmaarules
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views15 pages

Bollywood and Heart Data Analysis

The document presents a data analysis of Bollywood movies and heart disease data using Python libraries such as pandas and seaborn. It includes various analyses such as movie counts by genre and month, return on investment (ROI) calculations, and correlations between different variables like budget and box office collections. Additionally, it explores heart disease data, providing insights into the structure and content of the datasets.

Uploaded by

karmaarules
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

9/20/22, 8:59 PM Bollywood and Heart Data Analysis

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set(color_codes=True)

import warnings
warnings.filterwarnings('ignore')
warnings.warn('DelftStack')
warnings.warn('Do not show this message')
print("No Warning Shown")

No Warning Shown

In [2]:
BW = pd.read_csv('bollywood.csv')
BW.head()

Out[2]: Release
SlNo MovieName ReleaseTime Genre Budget BoxOfficeCollection YoutubeViews You
Date

18-Apr-
0 1 2 States LW Romance 36 104.00 8576361
14

4-Jan-
1 2 Table No. 21 N Thriller 10 12.00 1087320
13

18-Jul- Amit Sahni


2 3 N Comedy 10 4.00 572336
14 Ki List

4-Jan- Rajdhani
3 4 N Drama 7 0.35 42626
13 Express

Bobby
4 5 4-Jul-14 N Comedy 18 10.80 3113427
Jasoos

In [3]:
print(BW.shape)
BW.info()

(149, 10)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 149 entries, 0 to 148
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 SlNo 149 non-null int64
1 Release Date 149 non-null object
2 MovieName 149 non-null object
3 ReleaseTime 149 non-null object
4 Genre 149 non-null object
5 Budget 149 non-null int64
6 BoxOfficeCollection 149 non-null float64
7 YoutubeViews 149 non-null int64
8 YoutubeLikes 149 non-null int64
9 YoutubeDislikes 149 non-null int64

localhost:8888/nbconvert/html/Bollywood and Heart Data Analysis.ipynb?download=false 1/15


9/20/22, 8:59 PM Bollywood and Heart Data Analysis
dtypes: float64(1), int64(5), object(4)
memory usage: 11.8+ KB

In [4]:
Movies_by_genre = BW.groupby('Genre')['MovieName'].count().reset_index(name="MovieName_
print(Movies_by_genre.sort_values('MovieName_count',ascending = False))
sns.set_context("paper", font_scale= 1.5)
plt.title("MovieName_count vs Month")
sns.barplot(x='Genre',y ='MovieName_count', data = Movies_by_genre)
plt.xticks(rotation= 80)
plt.show()
Movies_by_genre['MovieName_count'].max()

Genre MovieName_count
3 Comedy 36
0 Drama 35
5 Thriller 26
4 Romance 25
1 Action 21
2 Action 3
6 Thriller 3

Out[4]: 36

In [5]:
cross_tab = pd.crosstab(BW.Genre, BW.ReleaseTime)
cross_tab

Out[5]: ReleaseTime FS HS LW N

Genre

Drama 4 6 1 24

Action 3 3 3 12

Action 0 0 0 3

Comedy 3 5 5 23

Romance 3 3 4 15

localhost:8888/nbconvert/html/Bollywood and Heart Data Analysis.ipynb?download=false 2/15


9/20/22, 8:59 PM Bollywood and Heart Data Analysis

ReleaseTime FS HS LW N

Genre

Thriller 4 1 1 20

Thriller 0 0 1 2

In [6]:
BW['Month'] = pd.DatetimeIndex(BW['Release Date']).month
BW.head(2)
Movies_by_month = BW.groupby('Month')['MovieName'].count().reset_index(name="Movie_coun
print(Movies_by_month.sort_values('Movie_count',ascending = False))
sns.set_context("paper", font_scale= 1.5)
plt.title("Movie Count vs Month")
sns.barplot(x='Month',y ='Movie_count', data = Movies_by_month)
plt.show()
Movies_by_month['Movie_count'].max()

Month Movie_count
0 1 20
2 3 19
4 5 18
1 2 16
6 7 16
3 4 11
5 6 10
8 9 10
10 11 10
9 10 9
7 8 8
11 12 2

Out[6]: 20

In [7]:
High_budget = BW[(BW['Budget'] > 25)]
HighBudgetMovies_by_month = High_budget.groupby('Month')['MovieName'].count().reset_ind
print(HighBudgetMovies_by_month.sort_values('Movie_count',ascending = False))
sns.set_context("paper", font_scale= 1.5)
plt.title("Movie Count vs Month")
sns.barplot(x='Month',y ='Movie_count', data = HighBudgetMovies_by_month)

localhost:8888/nbconvert/html/Bollywood and Heart Data Analysis.ipynb?download=false 3/15


9/20/22, 8:59 PM Bollywood and Heart Data Analysis
plt.show()
HighBudgetMovies_by_month['Movie_count'].max()

Month Movie_count
1 2 9
7 8 7
0 1 6
2 3 6
6 7 6
10 11 6
5 6 5
3 4 4
8 9 4
9 10 4
4 5 3
11 12 2

Out[7]: 9

In [8]:
BW['ROI'] = (BW['BoxOfficeCollection']-BW['Budget'])/BW['Budget']
Top10_ROI = BW.sort_values('ROI',ascending = False)
Top10 = Top10_ROI[['MovieName','ROI','ReleaseTime']].head(10)
Top10

Out[8]: MovieName ROI ReleaseTime

64 Aashiqui 2 8.166667 N

89 PK 7.647059 HS

132 Grand Masti 7.514286 LW

135 The Lunchbox 7.500000 N

87 Fukrey 6.240000 N

58 Mary Kom 5.933333 N

128 Shahid 5.666667 FS

37 Humpty Sharma Ki Dulhania 5.500000 N

101 Bhaag Milkha Bhaag 4.466667 N

localhost:8888/nbconvert/html/Bollywood and Heart Data Analysis.ipynb?download=false 4/15


9/20/22, 8:59 PM Bollywood and Heart Data Analysis

MovieName ROI ReleaseTime

115 Chennai Express 4.266667 FS

In [9]:
cross_tab_ROI = pd.crosstab(Top10.ROI,Top10.ReleaseTime)
print(cross_tab_ROI)
Avg_ROI = Top10.groupby('ReleaseTime')['ROI'].mean()
Avg_ROI

ReleaseTime FS HS LW N
ROI
4.266667 1 0 0 0
4.466667 0 0 0 1
5.500000 0 0 0 1
5.666667 1 0 0 0
5.933333 0 0 0 1
6.240000 0 0 0 1
7.500000 0 0 0 1
7.514286 0 0 1 0
7.647059 0 1 0 0
8.166667 0 0 0 1
Out[9]: ReleaseTime
FS 4.966667
HS 7.647059
LW 7.514286
N 6.301111
Name: ROI, dtype: float64

In [27]:
sns.set_context("paper", font_scale= 1.5)
plt.title("Histogram+Density Plot(Budget)")
sns.distplot(BW['Budget'], hist = True, color ='r')

### Most the movies are in the range of 2-50 crores


### There are few movies above 50 crores
### distribution looks normal with slight right skewness

Out[27]: <AxesSubplot:title={'center':'Histogram+Density Plot(Budget)'}, xlabel='Budget', ylabel


='Density'>

In [11]:
Comedy_ROI = BW[(BW['Genre'] == 'Comedy')]
localhost:8888/nbconvert/html/Bollywood and Heart Data Analysis.ipynb?download=false 5/15
9/20/22, 8:59 PM Bollywood and Heart Data Analysis
Drama_ROI = BW[(BW['Genre'] == ' Drama ')]
Drama_ROI.head(2)

plt.figure(figsize=(12,10))
sns.distplot(Drama_ROI['ROI'], hist = True, color = 'r', label = 'Drama')
sns.distplot(Comedy_ROI['ROI'], hist = True, color = 'b', label = 'Comedy')
plt.title('Drama vs Comedy', fontsize = 16)
plt.xlabel('Values', fontsize = 14)
plt.ylabel('Frequency', fontsize = 14)
plt.legend(loc = 'upper left', fontsize = 13)
plt.show()

In [12]:
sns.set_context("paper", font_scale= 1.5)
sns.lmplot(y="YoutubeLikes", x="BoxOfficeCollection", data=BW)
### Yes There is positive correlation between BoxOfficeCollection and YoutubeLikes

Out[12]: <seaborn.axisgrid.FacetGrid at 0x1ef2dfdb970>

localhost:8888/nbconvert/html/Bollywood and Heart Data Analysis.ipynb?download=false 6/15


9/20/22, 8:59 PM Bollywood and Heart Data Analysis

In [13]:
### Box Plots ###
plt.figure(figsize=(10,8))
sns.set_context("paper", font_scale= 1.5)
sns.boxplot(x="Genre", y="YoutubeLikes", data= BW, palette="Set3")
plt.xticks(rotation= 80)
plt.show()

### Action Movies has more youtubelikes

localhost:8888/nbconvert/html/Bollywood and Heart Data Analysis.ipynb?download=false 7/15


9/20/22, 8:59 PM Bollywood and Heart Data Analysis

In [14]:
plt.figure(figsize=(10,8))
Numerical_Variables = BW[['Budget','BoxOfficeCollection','YoutubeViews','YoutubeLikes',
sns.set_context("paper", font_scale= 1.5)
sns.heatmap(Numerical_Variables.corr(), cmap= 'YlGnBu', annot=True)
plt.show()
Numerical_Variables.corr().T

### Yes There is a Positive high Correlation among Budget, BoxOfficeCollection, Youtube

localhost:8888/nbconvert/html/Bollywood and Heart Data Analysis.ipynb?download=false 8/15


9/20/22, 8:59 PM Bollywood and Heart Data Analysis

Out[14]: Budget BoxOfficeCollection YoutubeViews YoutubeLikes YoutubeDislikes RO

Budget 1.000000 0.650401 0.589038 0.608916 0.665343 0.07205

BoxOfficeCollection 0.650401 1.000000 0.588632 0.682517 0.623941 0.58504

YoutubeViews 0.589038 0.588632 1.000000 0.884055 0.846739 0.25284

YoutubeLikes 0.608916 0.682517 0.884055 1.000000 0.859730 0.29130

YoutubeDislikes 0.665343 0.623941 0.846739 0.859730 1.000000 0.20153

ROI 0.072050 0.585042 0.252847 0.291302 0.201533 1.00000

In [15]:
Heart = pd.read_csv('SAheart.csv')
Heart.head()

Out[15]: sbp tobacco ldl adiposity famhist typea obesity alcohol age chd

0 160 12.00 5.73 23.11 Present 49 25.30 97.20 52 Si

1 144 0.01 4.41 28.61 Absent 55 28.87 2.06 63 Si

localhost:8888/nbconvert/html/Bollywood and Heart Data Analysis.ipynb?download=false 9/15


9/20/22, 8:59 PM Bollywood and Heart Data Analysis

sbp tobacco ldl adiposity famhist typea obesity alcohol age chd

2 118 0.08 3.48 32.28 Present 52 29.14 3.81 46 No

3 170 7.50 6.41 38.03 Present 51 31.99 24.26 58 Si

4 134 13.60 3.50 27.78 Present 60 25.99 57.34 49 Si

In [16]:
print(Heart.shape)
Heart.info()

(462, 10)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 462 entries, 0 to 461
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 sbp 462 non-null int64
1 tobacco 462 non-null float64
2 ldl 462 non-null float64
3 adiposity 462 non-null float64
4 famhist 462 non-null object
5 typea 462 non-null int64
6 obesity 462 non-null float64
7 alcohol 462 non-null float64
8 age 462 non-null int64
9 chd 462 non-null object
dtypes: float64(5), int64(3), object(2)
memory usage: 36.2+ KB

In [17]:
Group_data = Heart.groupby('chd')['famhist'].count().reset_index(name="famhist_count")
sns.set_context("paper", font_scale= 1.5)
plt.title('famhist_count vs chd')
sns.barplot(x = 'chd',y='famhist_count',data = Group_data)
plt.show()
Group_data.head()

Out[17]: chd famhist_count

0 No 302

localhost:8888/nbconvert/html/Bollywood and Heart Data Analysis.ipynb?download=false 10/15


9/20/22, 8:59 PM Bollywood and Heart Data Analysis

chd famhist_count

1 Si 160

In [18]:
sns.set_context("paper", font_scale= 1.5)
sns.lmplot(y="age", x="sbp", data= Heart)
# Yes there is correlation between age and sbp

Out[18]: <seaborn.axisgrid.FacetGrid at 0x1ef2e298c10>

In [19]:
yes_chd = Heart[(Heart['chd'] == 'Si')]
No_chd = Heart[(Heart['chd'] == 'No')]
No_chd.head(2)

plt.figure(figsize=(12,10))
sns.distplot(yes_chd['tobacco'], hist = True, color = 'r', label = 'yes_chd')
sns.distplot(No_chd['tobacco'], hist = True, color = 'b', label = 'No_chd')
plt.title('yes_chd vs No_chd', fontsize = 16)
plt.xlabel('Values', fontsize = 14)
plt.ylabel('Frequency', fontsize = 14)
plt.legend(loc = 'upper left', fontsize = 13)
plt.show()

### Distribution show that those who consume tobacco there are higher chances of gettin

localhost:8888/nbconvert/html/Bollywood and Heart Data Analysis.ipynb?download=false 11/15


9/20/22, 8:59 PM Bollywood and Heart Data Analysis

In [20]:
plt.figure(figsize=(10,8))
Numerical_Variables1 = Heart[['sbp','obesity','age','ldl']]
sns.set_context("paper", font_scale= 1.5)
sns.heatmap(Numerical_Variables1.corr(), cmap= 'YlGnBu', annot=True)
plt.show()
Numerical_Variables1.corr().T

localhost:8888/nbconvert/html/Bollywood and Heart Data Analysis.ipynb?download=false 12/15


9/20/22, 8:59 PM Bollywood and Heart Data Analysis

Out[20]: sbp obesity age ldl

sbp 1.000000 0.238067 0.388771 0.158296

obesity 0.238067 1.000000 0.291777 0.330506

age 0.388771 0.291777 1.000000 0.311799

ldl 0.158296 0.330506 0.311799 1.000000

In [21]:
# her we define the threshhold or our age groups
age_groups = [0,15,35,55,64]

# and for convenience we give each of them a handy label


age_group_names = ['Young','adults','mid','old']

Heart['Age_group'] = pd.cut(Heart['age'], bins = age_groups, labels = age_group_names)


Heart.head(5)

Out[21]: sbp tobacco ldl adiposity famhist typea obesity alcohol age chd Age_group

0 160 12.00 5.73 23.11 Present 49 25.30 97.20 52 Si mid

1 144 0.01 4.41 28.61 Absent 55 28.87 2.06 63 Si old

2 118 0.08 3.48 32.28 Present 52 29.14 3.81 46 No mid

3 170 7.50 6.41 38.03 Present 51 31.99 24.26 58 Si old

4 134 13.60 3.50 27.78 Present 60 25.99 57.34 49 Si mid

localhost:8888/nbconvert/html/Bollywood and Heart Data Analysis.ipynb?download=false 13/15


9/20/22, 8:59 PM Bollywood and Heart Data Analysis

In [22]:
chd_cases = Heart[(Heart['chd'] == 'Si')]
Group_data1 = chd_cases.groupby('Age_group')['chd'].count().reset_index(name="chd_count
sns.set_context("paper", font_scale= 1.5)
plt.title('chd_count vs Age_group')
sns.barplot(x = 'Age_group',y='chd_count',data = Group_data1)
plt.show()
Group_data1.head(4)

Out[22]: Age_group chd_count

0 Young 0

1 adults 18

2 mid 81

3 old 61

In [23]:
sns.set_context("paper", font_scale= 1.5)
plt.figure(figsize=(10,8))
sns.boxplot(x="Age_group", y="ldl", data= Heart, palette="Set3")
plt.xticks(rotation= 80)
plt.show()

localhost:8888/nbconvert/html/Bollywood and Heart Data Analysis.ipynb?download=false 14/15


9/20/22, 8:59 PM Bollywood and Heart Data Analysis

localhost:8888/nbconvert/html/Bollywood and Heart Data Analysis.ipynb?download=false 15/15

You might also like