Python Project 1
Python Project 1
Life Expectancy
1.0
Trầ n Nguyễ n Quỳnh Anh
1.0
Nguyễ n Trầ n Hoàng Phúc
Đỗ Lê Huy
1.0
Plot 5
Trầ n Nguyễ n Quỳnh Anh Design Report
Write description
Plot 1
Design Report
Nguyễ n Trầ n Hoàng Phúc
Check content
Plot 6
Write description
Nguyễ n Vương Minh
Check grammar
Re-check plot 1-3
Plot 7
Đỗ Lê Huy Write description
Re-check plot 9-10
Plot 8 - 10
Hoàng Xuân Phước Check grammar + content
Re-check plot 6 - 8
Python Project 1
Vietnamese - German University
LIFE EXPECTANCY
&
SOCIO-ECONOMIC
WORLD BANK
DATASET
BASIC
INFORMATION
- SHRITEJ SHRIKANT CHAVAN -
Introduction
With 16 columns and 3307 rows of data, it provides a multifaceted
view of factors influencing human health and well-being. The "Life
Expectancy & Socio-Economic" dataset provides information on
various socio-economic factors and their impact on life expectancy
across different countries and regions.
Link
Life expectancy & Socio-Economic (world bank) dataset link:
https://www.kaggle.com/datasets/mjshri23/life-expectancy-and-
socio-economic-world-bank/data
The reason we choose this dataset
The "Life Expectancy & Socio-Economic" dataset offers a rich and
comprehensive exploration of the interplay between various socio-
economic indicators and life expectancy across different countries
and regions. This dataset is particularly intriguing due to its breadth,
covering aspects such as income groups, health expenditure,
education expenditure, unemployment rates, and prevalence of
undernourishment, among others.
04 Occurances of corruption of
different income groups by years
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
Socio_Economic_and_Life_expectancy =
pd.read_csv('C:\\Users\\Phucn\\Documents\\Python\\1.csv')
Plot1 = Socio_Economic_and_Life_expectancy.dropna(subset=["IncomeGroup",
"Life Expectancy World Bank"])
plt.figure(figsize=(10, 6))
sns.violinplot(data=Plot1, x='IncomeGroup', y='Life Expectancy World Bank',
scale='width', inner='quartile', palette='PuBu')
plt.legend([],[], frameon=False)
sns.set_style("whitegrid")
plt.show()
LIFE EXPECTANCY BY
INCOME GROUP
Socio_Economic_and_Life_expectancy = pd.read_csv('C:\\Users\\HP\\Desktop\\VGU\\Python\\Project_1\\life
expectancy.csv')
income_levels = ["Low income", "Lower middle income", "Upper middle income", "High income"]
avg_diseases_and_injuries['IncomeGroup'] = pd.Categorical(avg_diseases_and_injuries['IncomeGroup'],
categories=income_levels, ordered=True)
sns.set_theme(style="whitegrid")
palette = {
'Communicable': 'red',
'NonCommunicable': 'blue',
'Injuries': 'yellow'
}
g = sns.catplot(
data=avg_diseases_and_injuries,
x='IncomeGroup', y='AverageCount', hue='DiseaseType',
kind='bar', col='DiseaseType', col_wrap=3, sharey=False,
palette=palette
)
for ax in g.axes.flat:
for p in ax.patches:
if p.get_height() > 0:
ax.annotate(f'{p.get_height():.2f}',
(p.get_x() + p.get_width() / 2., p.get_height()),
ha='center', va='center', xytext=(0, 10), textcoords='offset points')
for ax in g.axes.flat:
ax.yaxis.set_major_formatter(FuncFormatter(lambda x, _: f'{int(x):,}'))
for ax in g.axes.flat:
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right')
g._legend.remove()
plt.show()
Average DALYs due to various factors
As seen from the plot, the DALYs caused by injuries aren’t as prevalent
as DALYs caused by diseases. The highest amount of healthy years lost
by Injuries averages at around 2200000 in the Lower-Middle Income
group.
Overall, the plot shows that Lower Income countries are affected by
both communicable and non-communicable diseases, whereas Higher
income countries have most DALYs only due to non-communicable
diseases, suggesting that these countries have a better quality of life
than poorer ones. Moreover, Injuries does not amount to DALYs as
much as diseases in any income groups.
The percentage of Income Groups
of different Regions
The percentage of Income Groups
of different Regions
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
region_income_counts = pd.crosstab(Socio_Economic_and_Life_expectancy['Region'],
Socio_Economic_and_Life_expectancy['IncomeGroup'])
region_income_dataframe = region_income_counts.reset_index().melt(id_vars='Region',
var_name='IncomeGroup', value_name='Count')
region_income_dataframe = region_income_dataframe[region_income_dataframe['Count'] != 0]
region_income_dataframe['Percentage'] = region_income_dataframe.groupby('Region')
['Count'].transform(lambda x: x / x.sum() * 100)
income_group_colors = {
'High income': 'red',
'Upper middle income': 'blue',
'Lower middle income': 'green',
'Low income': 'yellow'
}
sns.set(style="whitegrid")
g = sns.FacetGrid(region_income_dataframe, col="Region", col_wrap=2, sharex=False, sharey=False)
g.map_dataframe(pie_plot)
g.set_titles("{col_name}")
g.fig.suptitle("The percentage of Income Groups of different Regions", y=1.05)
plt.subplots_adjust(top=0.90, right=0.85)
plt.show()
The percentage of Income Groups
of different Regions
From the plot, we can see that regions such as East Asia & Pacific,
Europe & Central Asia, Latin America & Caribbean and Middle East
& North Africa have no Low income countries, with North America
having 100% of its countries in the High income group.
Overall, the plot shows that the plot shows that, for the majority of
regions, there are no presence of Low income group. At the same
time, North America is only consisted of High income countries. Low
income countries are most prevalent in Sub-Saharan African
countries and Lower-middle income countries in South Asia.
OCCURANCES OF CORRUPTION
OF DIFFERENT INCOME GROUPS
BY YEARS
OCCURANCES OF CORRUPTION
OF DIFFERENT INCOME GROUPS
BY YEARS
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
Socio_Economic_and_Life_expectancy =
pd.read_csv('C:\\Users\\HP\\Desktop\\VGU\\Python\\Project_1\\life expectancy.csv')
Socio_Economic_and_Life_expectancy['Corruption'] =
Socio_Economic_and_Life_expectancy['Corruption'].fillna('N/A')
income_order = ["Low income", "Lower middle income", "Upper middle income", "High income"]
Socio_Economic_and_Life_expectancy['IncomeGroup'] =
pd.Categorical(Socio_Economic_and_Life_expectancy['IncomeGroup'],
categories=income_order, ordered=True)
g = sns.catplot(
data=Socio_Economic_and_Life_expectancy,
x='Corruption',
hue='IncomeGroup',
kind='count',
palette='viridis',
col='IncomeGroup',
col_wrap=2,
height=4,
aspect=1,
legend=False
)
for ax in g.axes.flatten():
for c in ax.containers:
labels = [f'{int(v.get_height())}' for v in c]
ax.bar_label(c, labels=labels, label_type='edge', padding=2, fontsize=10)
for ax in g.axes.flatten():
ax.tick_params(axis='x', rotation=45)
ax.set_xticks(range(len(Socio_Economic_and_Life_expectancy['Corruption'].unique())))
ax.set_xticklabels(Socio_Economic_and_Life_expectancy['Corruption'].unique(), rotation=45)
plt.show()
OCCURANCES OF CORRUPTION
OF DIFFERENT INCOME GROUPS
BY YEARS
Overall, the plot shows that for the countries that are in the
higher income groups, there are low to no presence of corruption.
Lower income groups, however, are more prone to corruption of
various levels.
HEALTH AND EDUCATION EXPENDITURE OF
COUNTRIES IN DIFFERENT REGIONS
HEALTH AND EDUCATION EXPENDITURE OF
COUNTRIES IN DIFFERENT REGIONS
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
file_path = 'C:\\Users\\HP\\Desktop\\VGU\\Python\\Project_1\\life
expectancy.csv'
Socio_Economic_and_Life_expectancy = pd.read_csv(file_path)
filtered_data = Socio_Economic_and_Life_expectancy[['Health
Expenditure %', 'Education Expenditure %', 'Region']].dropna()
regions = filtered_data['Region'].unique()
palette = dict(zip(regions, sns.color_palette("tab10", len(regions))))
g = sns.FacetGrid(filtered_data, col="Region", col_wrap=3,
height=4, sharex=False, sharey=False)
g.map_dataframe(sns.scatterplot, x="Health Expenditure %",
y="Education Expenditure %", hue="Region", palette=palette,
legend=False)
g.set_titles(col_template="{col_name}")
plt.subplots_adjust(top=0.9)
g.fig.suptitle("Health and Education Expenditure of Countries in
Different Regions", fontsize=16)
plt.show()
HEALTH EXPENDITURE EXPENDITURE OF
COUNTRIES IN DIFFERENT REGIONS
This scatter plot visualizes the relationship between Health
Expenditure % and Education Expenditure % across different
regions. Each point corresponds to a specific country within a
region and indicates the values of health expenditure and
education expenditure as a percentage of GDP for that country.
From the plot above, we can see that most countries from
different regions spend the majority of 2 to 10% on Health and
Education Expenditure. These regions include Sub-Saharan Africa,
East Asia & Pacific, Europe & Central Asia and Latin America &
Caribbean. Regions such as South Asia and Middle East & North
Africa however have a more varying expenditure on Health and
Education expenditure.
Socio_Economic_and_Life_expectancy =
pd.read_csv('C:\\Users\\HP\\Desktop\\VGU\\Python\\Project_1\\life expectancy.csv')
filtered_data = Socio_Economic_and_Life_expectancy.dropna(subset=['Prevelance of
Undernourishment'])
income_group_order = filtered_data['IncomeGroup'].unique()
custom_palette = {"Low income": "red", "Lower middle income": "blue", "Upper middle
income": "yellow", "High income": "green"}
for ax in g.axes.flat:
ax.set_xlabel('Prevalence of Undernourishment (%)')
ax.set_xlim(0, 60)
ax.set_title('')
g.set_axis_labels("", "")
g.fig.suptitle("Prevalence of Undernourishment across various income groups", y=0.99)
g.add_legend(title='Income Group')
plt.show()
Prevalence of Undernourishment
across various income groups
Socio_Economic_and_Life_expectancy =
pd.read_csv(r"C:\Users\Admin\OneDrive\Documents\Python\life expectancy.csv")
filtered_data = Socio_Economic_and_Life_expectancy[['Year', 'Unemployment',
'Education Expenditure %', 'Region']].dropna()
plt.subplots_adjust(top=0.92)
g.fig.suptitle('Average Education Expenditure vs Average Unemployment Rate by
Region', fontsize=16)
plt.show()
Average Education Expenditure vs
Average Unemployment Rate by Region
The plot illustrates how the average unemployment rate relates to the
average education expenditure percentage across different regions over
several years. Each point represents a specific year within a region.
As we observe the plot, we can see that there are different positive or
negative relationships between average education expenditure and average
unemployment for different regions. With East Asia & Pacific displaying
almost no correlation between unemployment and education.
For Sub-Saharan African countries, the general trend is that the year that
countries spent more on education is also the year with the lesser amount
of unemployment. Some years have an average of Education Expenditure
ranging from 3.25% to 4.25% have unemployment from 9 to nearly 11%. But
for the most part, for the years that spent the same amount of education
expenditure have only around 6 to 8% of unemployment. The same trend
also happens for East South Asia and Latin America & Caribbean.
Countries in Europe & Central Asia, Middle East & North Africa and North
America, however, have a positive correlation between average education
and average unemployment. With the most notable one, being Middle East
& North Africa. For the years that have a recording of average education
expenditure ranging from 4.25 to 4.75%. They have an unemployment rate
hovering around 6 to 7%. However, the years that spend more than 4.75%
also seem to have a higher unemployment rate, with an instance of 6% in
Education expenditure but also 11% in the unemployment rate.
Socio_Economic_and_Life_expectancy =
pd.read_csv(r"C:\Users\Admin\OneDrive\Documents\Python\life expectancy.csv")
filtered_data = Socio_Economic_and_Life_expectancy[['Year', 'CO2', 'Region']].dropna()
for ax in g.axes.flat:
ax.yaxis.set_major_formatter(FuncFormatter(lambda x, _: f'{int(x):,}'))
plt.subplots_adjust(left=0.1, top=0.92)
plt.show()
AVERAGE CO2 EMISSIONS OF DIFFERENT
REGIONS FROM 2001 TO 2019
The plot visualizes the average levels of CO2 emissions over time
across different regions. Each line represents the average CO2
levels for a specific region.
Looking at the graph, we can see that almost all regions display a
steep increase in CO2 emissions every year, with only North
America and Europe & Central Asia showing a downward trend.
East Asia & Pacific have the most amount of CO2 emissions of all
regions. It was reaching up to 650,000 Kilotons in 2019. On the
other hand, Sub-Saharan Africa has the lowest amount of CO2
emissions, even on the upward trend, it reaches only about 18,500
kilotons.
North America and Europe & Central Asia are different from other
regions because their CO2 emission is decreasing. North America
had a CO2 level reported in 2015 at around 3,100,000 kilotons but
at the end of 2019, the number is now at 2,700,000 kilotons. The
same goes for Europe & Central Asia with their reported CO2
emission at 105,000 and ending in 2019 at around 87,000 kilotons.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
Socio_Economic_and_Life_expectancy =
library(ggplot2)
pd.read_csv('C:\\Users\\huydo\\OneDrive\\Desktop\\Python\\life expectancy.csv')
library(dplyr)
filtered_data = Socio_Economic_and_Life_expectancy[['Year', 'Unemployment',
filtered_data <-
'Region']].dropna()
avg_unemployment = filtered_data.groupby(['Year', 'Region'],
na.omit(Socio_Economic_and_Life_expectancy[c("Sanitation",
as_index=False).agg({'Unemployment': 'mean'}).rename(columns={'Unemployment':
"Region", "IncomeGroup")])
'avg_unemployment'})
ggplot(filtered_data,
plt.figure(figsize=(15, 10)) aes(x = Region, y = Sanitation, fill =
gIncomeGroup))
= sns.FacetGrid(avg_unemployment,
+ col="Region", col_wrap=4, sharey=False)
g.map_dataframe(sns.histplot, x='avg_unemployment', binwidth=0.5, kde=False,
geom_boxplot() +
alpha=0.7)
labs(y = "Sanitation %", fill = "Income Group") +
g.set_titles("{col_name}")
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1), axis.title.x =
plt.subplots_adjust(top=0.9)
g.fig.suptitle("Distribution of average unemployment of different years in different
element_blank())+
regions", fontsize=16)
scale_fill_viridis_d(option = "C")
num_regions = avg_unemployment['Region'].nunique()
colors = sns.color_palette("viridis", num_regions)
for ax in g.axes.flatten():
ax.set_xlabel('')
ax.set_ylabel('')
g.fig.text(0.5, 0.02, 'Average Unemployment Rate', ha='center', fontsize=12)
g.fig.text(0.02, 0.5, 'Number of Years', va='center', rotation='vertical', fontsize=12)
plt.show()
Distribution of average Unemployment of
different years in different regions
Upon examination, we can see that the East Asia & Pacific
region has the lowest average rate of unemployment, with the
majority of the countries having an average of 2.6 to nearly 4%
of their labor forces. Alongside the East Asia & Pacific region,
South Asia also exhibit a low average unemployment rate with a
concentration of around 5%.
income_groups = filtered_data['IncomeGroup'].unique()
colors = ['red', 'blue', 'green', 'yellow']
if len(income_groups) > len(colors):
raise ValueError("Not enough colors defined for the number of unique
IncomeGroups")
palette_dict = dict(zip(income_groups, colors[:len(income_groups)]))
plt.figure(figsize=(12, 8))
sns.boxplot(data=filtered_data, x='Region', y='Sanitation', hue='IncomeGroup',
palette=palette_dict)
plt.xlabel('')
plt.ylabel('Sanitation %')
plt.title('Average Sanitation across different regions and Income Groups')
plt.xticks(rotation=45, ha='right')
plt.legend(title='Income Group')
sns.set_theme(style="whitegrid")
plt.tight_layout()
plt.show()
Average Sanitation across different
regions and income groups
The provided plot visualizes the relationship between average
sanitation levels, regions, and income groups. The x-axis denotes
various regions, while the y-axis represents the average percentage
of the population of the corresponding regions that has access to
safe sanitation services.
Beginning with the broad overview, it is clear that the High income
countries are the ones that remain the most sanitized.
In conclusion, the plot shows that the higher income group will have
more of its population having access to clean sanitation services. It
also highlights the severe lack of sanitation in countries throughout
Sub-Saharan Africa.