Assignment EDA Casestudy11
Assignment EDA Casestudy11
Assignment EDA Casestudy11
Masters of Science
By
A.LAHARI
Registration Number: 2139465
submitted to
SIVAKUMAR.R
CHRIST UNIVERSITY
Banglore, Karnataka-560029
(DEEMED TO BE UNIVERSITY)
August, 2021
1
OLYMPICS-DATASET-ANALYSIS
introduction:
The dataset chosen consists of 271116 rows and 15 columns. This data set deals with the
various participants of selected athelets. Particulars like Athlete's name; Sex - M or F;Age -
Integer; Height - In centimeters; Weight - In kilograms; Team - Team name and various
columns.This is a historical dataset on the modern Olympic Games, including all the
Games from Athens 1896 to Rio 2016.To be noted that the Winter and Summer Games
were held in the same year up until 1992. After that,they staggered them such that Winter
Games occur on a four year cycle starting with 1994, then Summer in 1996, then Winter in
1998, and so on.I have done my analysis primarly on Summer Olympics
Content
The file athlete_events.csv contains 271116 rows and 15 columns; Each row corresponds to an
individual athlete competing in an individual Olympic event . Below table show the various variable level
considered of their respective attributes in the selected data set.
2
Importing Dataset
[ ]: import pandas as pd
df = pd.read_csv('/content/120-years-of-olympic-history-athletes-and-results/
‹→athlete_events.csv')
df.head()
[5 rows x 15 columns]
df.shape
[ ]: (271116, 15)
nan_values = df.isna()
nan_columns = nan_values.any()
columns_with_nan = df.columns[nan_columns].tolist()
print(columns_with_nan)
[ ]: df[['Age','Height','Weight']] = df[['Age','Height','Weight']].fillna(0)
df.Medal = df.Medal.fillna('None')
df.Age = df.Age.astype(int)
3
[ ]: ID Name … Event Medal
0 1 A Dijiang … Basketball Men's Basketball None
1 2 A Lamusi … Judo Men's Extra-Lightweight None
2 3 Gunnar Nielsen Aaby … Football Men's Football None
3 4 Edgar Lindenau Aabye … Tug-Of-War Men's Tug-Of-War Gold
4 5 Christine Jacoba Aaftink … Speed Skating Women's 500 metres None
[5 rows x 15 columns]
plt.figure(figsize=(12,6))
plt.xticks(rotation=75)
plt.title('Overall Participation Countrywise')
sns.barplot(x=top_countries.index, y=top_countries);
4
As USA has historically won maximum no of medals it would make sense the participation is highest
from US. Surprisingly Soviet Union is not present in the list of top 10 countries.
Age Distribution
[ ]: import numpy as np
plt.figure(figsize=(12, 6))
# plt.title(df.Age)
plt.xlabel('Age')
plt.ylabel('Number of Participant')
5
From the above distribution we observe maximum participants are of age between 22 - 26 years,
Which would make sense as it is likely for people with less age would perfrom better in acitve sport.
Gender Distribution
[ ]: gender_counts = df.Sex.value_counts()
gender_counts
[ ]: M 196594
F 74522
Name: Sex, dtype: int64
[ ]: plt.figure(figsize=(12,6))
plt.title('Gender Distribution')
plt.pie(gender_counts, labels=gender_counts.index, autopct='%1.1f%%',␣
‹→startangle=180);
6
Male seems to be dominating in terms of paricipation. Let us check the female paricipants of the
years 1900 to 2016
[ ]: Year Sex
0 1900 33
1 1904 16
2 1906 11
3 1908 47
4 1912 87
7
Participants across season
[ ]: Diff_seasons = df.Season.value_counts()
Diff_seasons
[ ]: Summer 222552
Winter 48564
Name: Season, dtype: int64
Now, Lets try to explore the sports and event across winter and summer
Olympics
winter_olympic = df[df.Season=='Winter']
winter_sports = len(winter_olympic[['Sport']].drop_duplicates()) winter_events
=[ len(winter_olympic[['Event']].drop_duplicates())
]: print(f'Sports Played:
{winter_sports}, Events held: {winter_events}')
Now, Lets try to explore the sports and event across winterand summer Olympics
[ ]:
summer_olympic = df[df.Season=='Summer']
summer_sports = len(summer_olympic[['Sport']].drop_duplicates()) summer_events
Now, Lets try to explore the sports and event across winterandprint(f'Sports
= len(summer_olympic[['Event']].drop_duplicates()) summer Olympics
Played:
{summer_sports}, Events held: {summer_events}')
8
Asking and Answering Questions
We’ve already gained several insights about the participants involved in Olympics. Let’s ask some
specific questions and try to answer them using data frame operations and visualizations.
Q: Which countries WON the maximum Gold Medals in last held Olympic
competitions
[ ]: max_year = df.Year.max()
team_list.value_counts().head(10)
[ ]: United States 137
Great Britain 64
Russia 50
Germany 47
China 44
Brazil 34
Australia 23
Argentina 21
France 20
Japan 17
Name: Team, dtype: int64
[ ]: sns.barplot(x=team_list.value_counts().head(20), y=team_list.value_counts().
‹→head(20).index)
# plt.title(schema.UndergradMajor)
plt.ylabel(None);
plt.xlabel('Contrywise Medals for the year 2016');
9
US seems to lead the Gold medal charts for the last held Olympics in the year 2016.
US_Gold = US_Gold[['Sport','Medal']].groupby('Sport').count()
US_Gold.reset_index(inplace=True)
Top_sports = US_Gold.sort_values('Medal', ascending=False)
Top_sports.head()
[ ]: Sport Medal
8 Swimming 48
0 Athletics 27
1 Basketball 24
10 Water Polo 13
6 Rowing 9
It seems like Swimming fetched the maximum Gold medals to US. Below is the visual representation
of the same.
10
Q: Top 10 Indiviual winning maximum number of Olympics Medals for their
Country?
[ ]:
df_medal_holders.head()
11
Season = 'Summer'
group by Year, Medal, Team, Name
order by Year desc, Highest_Number_Of_Medals_per_Year)
group by Team, Year
order by Year, Highest_Number_Of_Medals_per_Year desc)
group by Year
order by Year desc
''')
[ ]: output
12
Q:countries with highest medals?
‹→sort_values('Count_Of_Medals', ascending=False)
df_highest_medals = df_highest_medals.groupby(['Name','Team']).sum().
‹→sort_values('Count_Of_Medals', ascending=False).head(10)
df_highest_medals.reset_index(inplace=True)
[ ]: df_highest_medals
13
Q: Spread of Medal based on Age, Height and WeightAge
and Height
[ ]: df1 = df[(df.Age != 0) & (df.Height != 0.0) & (df.Medal!='None') & (df.Season␣
‹→=='Summer')]
14
An interesting observation is sportsmans with height with less that 140 and age less that 20 are
also winning Medals at olympics. Let us see in which sport they for this medals
It is clearly visible athlete with lesser heights and age seems to have done well in gymnastics
15
Let us have a look at the medalist having high weight and also the one have low weight
[ ]: df2[(df2['Weight']>=170) | (df['Weight']<=30)][['Name','Weight','Sport']]
16
plt.ylabel("Weight");
[ ]: df3[(df.Weight>160)]
.
[ ]: ID … Medal
39181 20144 … Gold
39182 20144 … Bronze
124420 62843 … Silver
173166 87041 … Bronze
237040 118869 … Bronze
268659 134407 … Gold
268660 134407 … Gold
[7 rows x 15 columns]
17
Q: Women participation at olympics
sns.set(style="darkgrid")
plt.figure(figsize=(20, 10))
sns.countplot(x='Year', data=Women_In_Olympics)
plt.title('Women medals per edition of the Games');
As we see the trend, Woman paricipation has been increasing over the years on an average
df_medal_holders_above50.head(10)
18
[ ]: ID Name … Medal Count_Of_Medals
3680 2112 Abdullah Al-Rashidi … Bronze 1
5077 2894 Derek Swithin Allhusen … Silver 1
5078 2894 Derek Swithin Allhusen … Gold 1
7961 4404 Johan August Anker … Gold 1
13393 7272 Nikolaus "Klaus" Balkenhol … Bronze 1
13394 7272 Nikolaus "Klaus" Balkenhol … Gold 1
13396 7272 Nikolaus "Klaus" Balkenhol … Gold 1 .
14364 7744 Ernest Barberolle … Silver 1
17552 9349 Ludger Beerbaum … Bronze 1
21999 11599 Rudolf Georg Binding … Silver 1
[ ]: df_medal_holders_above50_list = df_medal_holders_above50.groupby(['Sport']).
‹→sum().sort_values('Count_Of_Medals', ascending=False).
‹→drop(['ID','Age','Height','Weight','Year'], axis = 1)
df_medal_holders_above50_reset_index = df_medal_holders_above50_list.
‹→reset_index()
df_medal_holders_above50_reset_index.head()
[ ]: Sport Count_Of_Medals
0 Equestrianism 53
1 Shooting 50
2 Sailing 46
3 Art Competitions 37
4 Archery 34
Indiviuals above 50 have been doing in Equestrianism, Shooting, Sailing, Art competitions and
Archery. These sports seems to have require more mental strength and then physical strength.
19
INFERENCES AND CONCLUSIONS
We’ve drawn many inferences from the survey. Here’s a summary of a few of them:
• US seems to dominants in terms of participation of maximum gold as well as overall partici-
pation in games.
• We observe athletes from the age of 12 till the age of 58 years winning medals.
• Summer Olympics have higher no of events and sports as compared to the winter Olympics.
• In the history of 120 years of Olympics, Michael Fred Phelps, II has won maximum medals
for his country i.e. 28 Medals
• We see a trend that woman participants across the years in in upward tread.
• Participate with high weight (like > 150) seems to have done well in Wrestling, Weight lifting
and Judo.
20