Assignment EDA Casestudy11

Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

PYTHON DATASET

M.Sc. (DATA ANALYST) Dissertation


Submitted to christ university

Masters of Science
By

A.LAHARI
Registration Number: 2139465

submitted to
SIVAKUMAR.R

Department of Computer Science

CHRIST UNIVERSITY
Banglore, Karnataka-560029
(DEEMED TO BE UNIVERSITY)
August, 2021

1
OLYMPICS-DATASET-ANALYSIS

introduction:
The dataset chosen consists of 271116 rows and 15 columns. This data set deals with the
various participants of selected athelets. Particulars like Athlete's name; Sex - M or F;Age -
Integer; Height - In centimeters; Weight - In kilograms; Team - Team name and various
columns.This is a historical dataset on the modern Olympic Games, including all the
Games from Athens 1896 to Rio 2016.To be noted that the Winter and Summer Games
were held in the same year up until 1992. After that,they staggered them such that Winter
Games occur on a four year cycle starting with 1994, then Summer in 1996, then Winter in
1998, and so on.I have done my analysis primarly on Summer Olympics

Content
The file athlete_events.csv contains 271116 rows and 15 columns; Each row corresponds to an
individual athlete competing in an individual Olympic event . Below table show the various variable level
considered of their respective attributes in the selected data set.

1. ID - Unique number for each athlete;


2. Name - Athlete's name;
3. Sex - M or F;
4. Age - Integer;
5. Height - In centimeters;
6. Weight - In kilograms;
7. Team - Team name;
8. NOC - National Olympic Committee 3-letter code;
9. Games - Year and season;
10. Year - Integer;
11. Season - Summer or Winter;
12. City - Host city
13. Sport - Sport;
14. Event - Event;
15. Medal - Gold, Silver, Bronze, or NA.

2
Importing Dataset
[ ]: import pandas as pd

df = pd.read_csv('/content/120-years-of-olympic-history-athletes-and-results/
‹→athlete_events.csv')

df.head()

[ ]: ID Name … Event Medal


0 1 A Dijiang … Basketball Men's Basketball NaN
1 2 A Lamusi … Judo Men's Extra-Lightweight NaN
2 3 Gunnar Nielsen Aaby … Football Men's Football NaN
3 4 Edgar Lindenau Aabye … Tug-Of-War Men's Tug-Of-War Gold
4 5 Christine Jacoba Aaftink … Speed Skating Women's 500 metres NaN

[5 rows x 15 columns]

[ ]: # checking the size of the dataset

df.shape

[ ]: (271116, 15)

Data Preparation & Cleaning


[ ]: # identify the columns containing null values

nan_values = df.isna()
nan_columns = nan_values.any()

columns_with_nan = df.columns[nan_columns].tolist()
print(columns_with_nan)

['Age', 'Height', 'Weight', 'Medal']


As Age, height, weight are numerical columns. Replacing those values by zero. For Medal I will
replace the NaN values by None Also converting the Age fielding to interger

[ ]: df[['Age','Height','Weight']] = df[['Age','Height','Weight']].fillna(0)

df.Medal = df.Medal.fillna('None')

df.Age = df.Age.astype(int)

[ ]: # Let us look at the updated dataset


df.head()

3
[ ]: ID Name … Event Medal
0 1 A Dijiang … Basketball Men's Basketball None
1 2 A Lamusi … Judo Men's Extra-Lightweight None
2 3 Gunnar Nielsen Aaby … Football Men's Football None
3 4 Edgar Lindenau Aabye … Tug-Of-War Men's Tug-Of-War Gold
4 5 Christine Jacoba Aaftink … Speed Skating Women's 500 metres None

[5 rows x 15 columns]

Exploratory Analysis and Visualization


Before we proceed with questions on the olympic datasets, it would help us to understand the
participants ’ demographics, i.e., country, age, gender etc. It’s essential to explore these variables
to understandhow the representative the participants is of the worldwide sports community.

Top countires participating in olympics


[ ]: top_countries = df.Team.value_counts().sort_values(ascending=False).head(10)
top_countries

[ ]: United States 17847


France 11988
Great Britain 11404
Italy 10260
Germany 9326
Canada 9279
Japan 8289
Sweden 8052
Australia 7513
Hungary 6547
Name: Team, dtype: int64

[ ]: import seaborn as sns


import matplotlib
import matplotlib.pyplot as plt

plt.figure(figsize=(12,6))
plt.xticks(rotation=75)
plt.title('Overall Participation Countrywise')
sns.barplot(x=top_countries.index, y=top_countries);

4
As USA has historically won maximum no of medals it would make sense the participation is highest
from US. Surprisingly Soviet Union is not present in the list of top 10 countries.

Age Distribution
[ ]: import numpy as np
plt.figure(figsize=(12, 6))
# plt.title(df.Age)
plt.xlabel('Age')
plt.ylabel('Number of Participant')

plt.hist(df.Age, bins=np.arange(10,80,2), color='purple');

5
From the above distribution we observe maximum participants are of age between 22 - 26 years,
Which would make sense as it is likely for people with less age would perfrom better in acitve sport.

Gender Distribution
[ ]: gender_counts = df.Sex.value_counts()
gender_counts

[ ]: M 196594
F 74522
Name: Sex, dtype: int64

[ ]: plt.figure(figsize=(12,6))
plt.title('Gender Distribution')
plt.pie(gender_counts, labels=gender_counts.index, autopct='%1.1f%%',␣
‹→startangle=180);

6
Male seems to be dominating in terms of paricipation. Let us check the female paricipants of the
years 1900 to 2016

[ ]: female_participants = df[(df.Sex=='F') & (df.Season=='Summer')][['Sex','Year']]


female_participants = female_participants.groupby('Year').count().reset_index()
female_participants.head()

[ ]: Year Sex
0 1900 33
1 1904 16
2 1906 11
3 1908 47
4 1912 87

7
Participants across season
[ ]: Diff_seasons = df.Season.value_counts()
Diff_seasons

[ ]: Summer 222552
Winter 48564
Name: Season, dtype: int64

Now, Lets try to explore the sports and event across winter and summer
Olympics

winter_olympic = df[df.Season=='Winter']
winter_sports = len(winter_olympic[['Sport']].drop_duplicates()) winter_events
=[ len(winter_olympic[['Event']].drop_duplicates())
]: print(f'Sports Played:
{winter_sports}, Events held: {winter_events}')

Now, Lets try to explore the sports and event across winterand summer Olympics

[ ]: Sports Played: 17, Events held: 119

[ ]:
summer_olympic = df[df.Season=='Summer']
summer_sports = len(summer_olympic[['Sport']].drop_duplicates()) summer_events
Now, Lets try to explore the sports and event across winterandprint(f'Sports
= len(summer_olympic[['Event']].drop_duplicates()) summer Olympics
Played:
{summer_sports}, Events held: {summer_events}')

[ ]: Sports Played: 52, Events held: 651


As per the above data we have 52 sports and 651 events in summer Olympics where
we have 17 sports and 119 events in winter Olympics. Hence we have higher number of
participants in summer Olympics

8
Asking and Answering Questions
We’ve already gained several insights about the participants involved in Olympics. Let’s ask some
specific questions and try to answer them using data frame operations and visualizations.

Q: Which countries WON the maximum Gold Medals in last held Olympic
competitions
[ ]: max_year = df.Year.max()

team_list = df[(df.Year == max_year) & (df.Medal=='Gold')].Team

team_list.value_counts().head(10)
[ ]: United States 137
Great Britain 64
Russia 50
Germany 47
China 44
Brazil 34
Australia 23
Argentina 21
France 20
Japan 17
Name: Team, dtype: int64

[ ]: sns.barplot(x=team_list.value_counts().head(20), y=team_list.value_counts().
‹→head(20).index)

# plt.title(schema.UndergradMajor)
plt.ylabel(None);
plt.xlabel('Contrywise Medals for the year 2016');

9
US seems to lead the Gold medal charts for the last held Olympics in the year 2016.

[ ]: US_Gold = df[(df.Year == max_year) & (df.Medal=='Gold') & (df.Team == 'United␣


‹→States')]

US_Gold = US_Gold[['Sport','Medal']].groupby('Sport').count()
US_Gold.reset_index(inplace=True)
Top_sports = US_Gold.sort_values('Medal', ascending=False)
Top_sports.head()

[ ]: Sport Medal
8 Swimming 48
0 Athletics 27
1 Basketball 24
10 Water Polo 13
6 Rowing 9

It seems like Swimming fetched the maximum Gold medals to US. Below is the visual representation
of the same.

10
Q: Top 10 Indiviual winning maximum number of Olympics Medals for their
Country?
[ ]:

df_medal_holders.head()

[ ]: ID Name … Medal Count_Of_Medals


36 4 Edgar Lindenau Aabye … Gold 1
37 15 Arvo Ossian Aaltonen … Bronze 1
38 15 Arvo Ossian Aaltonen … Bronze 1
39 16 Juhamatti Tapio Aaltonen … Bronze 1
40 17 Paavo Johannes Aaltonen … Bronze 1
[5 rows x 16 columns]

[ ]: Countries winning maximum Medals per year ?

11
Season = 'Summer'
group by Year, Medal, Team, Name
order by Year desc, Highest_Number_Of_Medals_per_Year)
group by Team, Year
order by Year, Highest_Number_Of_Medals_per_Year desc)
group by Year
order by Year desc
''')

[ ]: output

[ ]: Team Year Highest_Number_Of_Medals_per_Year


0 United States 2016 256
1 United States 2012 238
2 United States 2008 309
3 United States 2004 259
4 United States 2000 240
5 United States 1996 255
6 United States 1992 222
7 Soviet Union 1988 300
8 United States 1984 352
9 Soviet Union 1980 442
10 Soviet Union 1976 286
11 Soviet Union 1972 214
12 Soviet Union 1968 192
13 Soviet Union 1964 174
14 Soviet Union 1960 167
15 Soviet Union 1956 169
16 United States 1952 122
17 United States 1948 143
18 Germany 1936 215
19 United States 1932 170
20 United States 1928 88
21 United States 1924 174
22 United States 1920 194
23 Sweden 1912 153
24 Great Britain 1908 167
25 France 1906 45
26 United States 1904 199
27 France 1900 75
28 Greece 1896 44

12
Q:countries with highest medals?

[ ]: df_highest_medals = df_medal_holders[['Name', 'Year', 'Team',␣


‹→'Count_Of_Medals']].groupby(['Name','Year','Team']).sum().

‹→sort_values('Count_Of_Medals', ascending=False)

df_highest_medals = df_highest_medals.groupby(['Name','Team']).sum().
‹→sort_values('Count_Of_Medals', ascending=False).head(10)
df_highest_medals.reset_index(inplace=True)

[ ]: df_highest_medals

[ ]: Name Team Count_Of_Medals


0 Michael Fred Phelps, II United States 28
1 Larysa Semenivna Latynina (Diriy-) Soviet Union 18
2 Nikolay Yefimovich Andrianov Soviet Union 15
3 Ole Einar Bjrndalen Norway 13
4 Borys Anfiyanovych Shakhlin Soviet Union 13
5 Edoardo Mangiarotti Italy 13
6 Takashi Ono Japan 13
7 Paavo Johannes Nurmi Finland 12
8 Sawao Kato Japan 12
9 Dara Grace Torres (-Hoffman, -Minas) United States 12

[ ]: from matplotlib import pyplot as plt


import numpy as np
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
ax.axis('equal')
Team = df_highest_medals.Name
Count_of_Medal = df_highest_medals.Count_Of_Medals
ax.pie(Count_of_Medal, labels = Team,autopct='%1.2f%%')
plt.show()

13
Q: Spread of Medal based on Age, Height and WeightAge
and Height
[ ]: df1 = df[(df.Age != 0) & (df.Height != 0.0) & (df.Medal!='None') & (df.Season␣
‹→=='Summer')]

sns.scatterplot(x=df1.Age, y=df1.Height, hue='Sex', data=df1)


plt.xlabel("Age")
plt.ylabel("Height");

14
An interesting observation is sportsmans with height with less that 140 and age less that 20 are
also winning Medals at olympics. Let us see in which sport they for this medals

[ ]: df1[(df1.Height<=140.0) & (df1.Age <=20)][['Name','Height','Age','Sport']].


‹→sort_values(['Sport'])

[ ]: Name Height Age Sport


256836 Wang Xin (Ruoxue-) 137.0 15 Diving
256837 Wang Xin (Ruoxue-) 137.0 15 Diving
13741 Oana Mihaela Ban 139.0 18 Gymnastics
23763 Loredana Boboc 139.0 16 Gymnastics
31837 Diana Laura Bulimar 140.0 16 Gymnastics
69216 Mariya Yevgenyevna Filatova (-Kurbatova) 136.0 14 Gymnastics
69222 Mariya Yevgenyevna Filatova (-Kurbatova) 136.0 19 Gymnastics
69225 Mariya Yevgenyevna Filatova (-Kurbatova) 136.0 19 Gymnastics
108408 Jiang Yuyuan 140.0 16 Gymnastics
143279 Lu Li 136.0 15 Gymnastics
143280 Lu Li 136.0 15 Gymnastics
160840 Mo Huilan 140.0 17 Gymnastics
160920 Dominique Helena Moceanu (-Canales) 139.0 14 Gymnastics
256864 Wang Yan 140.0 16 Gymnastics
270182 Kimberley Lyn "Kim" Zmeskal (-Burdette) 139.0 16 Gymnastics

It is clearly visible athlete with lesser heights and age seems to have done well in gymnastics

Age and Weight


[ ]:

df2 = df[(df.Age != 0) & (df.Weight != 0.0) & (df.Medal!='None')& (df.Season␣


‹→=='Summer')]

sns.scatterplot(x=df2.Age, y=df2.Weight, hue='Sex', data=df2)


plt.xlabel("Age")
plt.ylabel("Weight");

15
Let us have a look at the medalist having high weight and also the one have low weight

[ ]: df2[(df2['Weight']>=170) | (df['Weight']<=30)][['Name','Weight','Sport']]

[ ]: Name Weight Sport


39181 Andrey Ivanovich Chemerkin 170.0 Weightlifting
39182 Andrey Ivanovich Chemerkin 170.0 Weightlifting
69216 Mariya Yevgenyevna Filatova (-Kurbatova) 30.0 Gymnastics
69222 Mariya Yevgenyevna Filatova (-Kurbatova) 30.0 Gymnastics
69225 Mariya Yevgenyevna Filatova (-Kurbatova) 30.0 Gymnastics
143279 Lu Li 30.0 Gymnastics
143280 Lu Li 30.0 Gymnastics
173166 Dmitry Yuryevich Nosov 175.0 Judo
237040 Christopher J. "Chris" Taylor 182.0 Wrestling
256836 Wang Xin (Ruoxue-) 28.0 Diving
256837 Wang Xin (Ruoxue-) 28.0 Diving

Height and Weight


[ ]:
df3 = df[(df.Age != 0) & (df.Weight != 0.0) & (df.Height != 0.0) & (df.Medal!
‹→='None') & (df.Season =='Summer')]

sns.scatterplot(x=df3.Height, y=df3.Weight, hue='Sex', data=df3)


plt.xlabel("height")

16
plt.ylabel("Weight");

Let us look at athlete who have high weight

[ ]: df3[(df.Weight>160)]

.
[ ]: ID … Medal
39181 20144 … Gold
39182 20144 … Bronze
124420 62843 … Silver
173166 87041 … Bronze
237040 118869 … Bronze
268659 134407 … Gold
268660 134407 … Gold

[7 rows x 15 columns]

As I observe most of such sport are Wrestling, Weightligting and Judo

17
Q: Women participation at olympics

[ ]: Women_In_Olympics = df[(df.Sex == 'F') & (df.Medal != 'None') & (df.Season␣


‹→=='Summer')]

sns.set(style="darkgrid")
plt.figure(figsize=(20, 10))
sns.countplot(x='Year', data=Women_In_Olympics)
plt.title('Women medals per edition of the Games');

As we see the trend, Woman paricipation has been increasing over the years on an average

Q: Medal won Indiviual with Age more than 50?

[ ]: df_medal_holders = df[(df.Medal !='None') & (df.Season == 'Summer')]


df_medal_holders['Count_Of_Medals'] = 1

df_medal_holders_above50 = df_medal_holders[df.Age >= 50]

df_medal_holders_above50.head(10)

18
[ ]: ID Name … Medal Count_Of_Medals
3680 2112 Abdullah Al-Rashidi … Bronze 1
5077 2894 Derek Swithin Allhusen … Silver 1
5078 2894 Derek Swithin Allhusen … Gold 1
7961 4404 Johan August Anker … Gold 1
13393 7272 Nikolaus "Klaus" Balkenhol … Bronze 1
13394 7272 Nikolaus "Klaus" Balkenhol … Gold 1
13396 7272 Nikolaus "Klaus" Balkenhol … Gold 1 .
14364 7744 Ernest Barberolle … Silver 1
17552 9349 Ludger Beerbaum … Bronze 1
21999 11599 Rudolf Georg Binding … Silver 1

[10 rows x 16 columns]


Q:Display the count of the medals for each sport?

[ ]: df_medal_holders_above50_list = df_medal_holders_above50.groupby(['Sport']).
‹→sum().sort_values('Count_Of_Medals', ascending=False).

‹→drop(['ID','Age','Height','Weight','Year'], axis = 1)

df_medal_holders_above50_reset_index = df_medal_holders_above50_list.
‹→reset_index()

df_medal_holders_above50_reset_index.head()

[ ]: Sport Count_Of_Medals
0 Equestrianism 53
1 Shooting 50
2 Sailing 46
3 Art Competitions 37
4 Archery 34

Indiviuals above 50 have been doing in Equestrianism, Shooting, Sailing, Art competitions and
Archery. These sports seems to have require more mental strength and then physical strength.

19
 INFERENCES AND CONCLUSIONS
We’ve drawn many inferences from the survey. Here’s a summary of a few of them:
• US seems to dominants in terms of participation of maximum gold as well as overall partici-
pation in games.
• We observe athletes from the age of 12 till the age of 58 years winning medals.
• Summer Olympics have higher no of events and sports as compared to the winter Olympics.
• In the history of 120 years of Olympics, Michael Fred Phelps, II has won maximum medals
for his country i.e. 28 Medals
• We see a trend that woman participants across the years in in upward tread.
• Participate with high weight (like > 150) seems to have done well in Wrestling, Weight lifting
and Judo.

 REFERENCES AND FUTURE WORK


Check out the following resources to learn more about the dataset and tools used in this notebook:
• 120 Years of Olympic Histroy: https://www.kaggle.com/heesoo37/120-years-of-olympic-
history-athletes-and-results
• Pandas user guide: https://pandas.pydata.org/docs/user_guide/index.html
• Matplotlib user guide: https://matplotlib.org/3.3.1/users/index.html
• Seaborn user guide & tutorial: https://seaborn.pydata.org/tutorial.html
• Numpy user guide & tutorial: https://numpy.pydata.org/tutorial.html

20

You might also like