0% found this document useful (0 votes)
16 views10 pages

230103-ECON209 S2025 Lab 2.ipynb-Colab

The document outlines a Jupyter notebook for conducting exploratory data analysis (EDA) using Python libraries such as pandas, seaborn, and matplotlib. It includes steps for importing a dataset from Google Drive, checking for missing values, and performing descriptive statistics on various columns including age, income, and education level. The analysis reveals insights into the dataset, including means, medians, and frequency counts for different variables.

Uploaded by

mthunguyen.work
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views10 pages

230103-ECON209 S2025 Lab 2.ipynb-Colab

The document outlines a Jupyter notebook for conducting exploratory data analysis (EDA) using Python libraries such as pandas, seaborn, and matplotlib. It includes steps for importing a dataset from Google Drive, checking for missing values, and performing descriptive statistics on various columns including age, income, and education level. The analysis reveals insights into the dataset, including means, medians, and frequency counts for different variables.

Uploaded by

mthunguyen.work
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

2/15/25, 10:23 PM [230103] ECON209_S2025__Lab_2.

ipynb - Colab

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

Writing 𝐿𝑇
𝐴 𝑋 in Colab
𝐸
It's the same as what we do on Overleaf, but here is the guide by Colab for your convenience.

keyboard_arrow_down Importing external dataset


keyboard_arrow_down Option 1
#Upload from drive
#Remember to upload your file ONTO GOOGLE DRIVE and paste the file path EXACTLY!
from google.colab import drive
drive.mount('/content/drive')
dfAd = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/PAPI2018_sample_clean.csv') #paste the file path exactly

Show hidden output

keyboard_arrow_down Exploratory Data Analysis EDA


keyboard_arrow_down Basic EDA
print(dfAd.head())

Unnamed: 0 id urban female age time_in_commune_or_ward \


0 0 7014 1 1 56.0 20.0
1 1 7003 1 0 37.0 37.0
2 2 3780 1 1 34.0 34.0
3 3 13742 1 1 36.0 36.0
4 4 11886 0 0 61.0 61.0

time_in_province lv_educ no_family_members party_member income


0 20 4.0 2 0 5000000.0
1 37 6.0 5 0 7000000.0
2 34 8.0 4 1 15000000.0
3 36 6.0 3 1 15000000.0
4 61 0.0 3 0 5000000.0

dfAd.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 5000 non-null int64
1 id 5000 non-null int64
2 urban 5000 non-null int64
3 female 5000 non-null int64
4 age 4994 non-null float64
5 time_in_commune_or_ward 5000 non-null float64
6 time_in_province 5000 non-null int64
7 lv_educ 4997 non-null float64
8 no_family_members 5000 non-null int64
9 party_member 5000 non-null int64
10 income 4563 non-null float64
dtypes: float64(4), int64(7)
memory usage: 429.8 KB

# Check missing values


print('Missing values: %i' % dfAd.isnull().sum().sum())

Missing values: 446

https://colab.research.google.com/drive/1wdiU4z6X7O8LtE1Khs6hwkfgz2s2I-lf#scrollTo=EV2ke4YlNOp1 1/10
2/15/25, 10:23 PM [230103] ECON209_S2025__Lab_2.ipynb - Colab
# Drop missing data if any
dfAd = dfAd.dropna()

# We can quickly get descriptive table


dfAd.describe()

Unnamed:
id urban female age time_in_commune_or_ward time_in_province lv_e
0

count 4557.000000 4557.000000 4557.000000 4557.000000 4557.000000 4557.000000 4557.000000 4557.000

mean 2508.703533 7147.316656 0.600176 0.523371 48.914198 38.429339 45.693878 4.478

std 1443.518452 4141.753328 0.489916 0.499508 11.582921 57.320103 50.448311 2.240

min 0.000000 1.000000 0.000000 0.000000 18.000000 1.000000 2.000000 0.000

25% 1261.000000 3603.000000 0.000000 0.000000 40.000000 22.000000 33.000000 3.000

50% 2509.000000 7064.000000 1.000000 1.000000 50.000000 35.000000 43.000000 4.000

75% 3760.000000 10684.000000 1.000000 1.000000 58.000000 48.000000 54.000000 6.000

max 4998.000000 14445.000000 1.000000 1.000000 95.000000 888.000000 888.000000 9.000

Looking at 'count', we see that there's no missing value.


From comparing median (50%) with mean, we can tell if the distribution is likely to be left-skewed or right-skewed. If median < mean,
then the distribution is right skewed. If the median > mean, then the distribution is left skewed.

keyboard_arrow_down Age
# or we can get more details for a given column
age_mean = dfAd['age'].mean()
age_var = dfAd['age'].var()
age_std = dfAd['age'].std()
age_median = dfAd['age'].median()
[q1, q3] = dfAd['age'].quantile([.25, .75]).values
age_min = dfAd['age'].min()
age_max = dfAd['age'].max()

print('mean:', age_mean)
print('var:', age_var)
print('stdev:', age_std)
print('median:', age_median)
print('q1:', q1)
print('q3:', q3)
print('min:', age_min)
print('max:', age_max)

mean: 48.914197937239415
var: 134.16405869039997
stdev: 11.582920991287127
median: 50.0
q1: 40.0
q3: 58.0
min: 18.0
max: 95.0

keyboard_arrow_down time-in-province
time_in_province_mean = dfAd['time_in_province'].mean()
time_in_province_var = dfAd['time_in_province'].var()
time_in_province_std = dfAd['time_in_province'].std()
time_in_province_median = dfAd['time_in_province'].median()
[q1, q3] = dfAd['time_in_province'].quantile([.25, .75]).values
time_in_province_min = dfAd['time_in_province'].min()
time_in_province_max = dfAd['time_in_province'].max()

print('mean:', time_in_province_mean)
print('var:', time_in_province_var)
print('stdev:', time_in_province_std)

https://colab.research.google.com/drive/1wdiU4z6X7O8LtE1Khs6hwkfgz2s2I-lf#scrollTo=EV2ke4YlNOp1 2/10
2/15/25, 10:23 PM [230103] ECON209_S2025__Lab_2.ipynb - Colab
print('median:', time_in_province_median)
print('q1:', q1)
print('q3:', q3)
print('min:', time_in_province_min)
print('max:', time_in_province_max)

mean: 45.69387755102041
var: 2545.0320366952697
stdev: 50.448310543518396
median: 43.0
q1: 33.0
q3: 54.0
min: 2
max: 888

keyboard_arrow_down no-famliy-members
no_family_members_mean = dfAd['no_family_members'].mean()
no_family_members_var = dfAd['no_family_members'].var()
no_family_members_std = dfAd['no_family_members'].std()
no_family_members_median = dfAd['no_family_members'].median()
[q1, q3] = dfAd['no_family_members'].quantile([.25, .75]).values
no_family_members_min = dfAd['no_family_members'].min()
no_family_members_max = dfAd['no_family_members'].max()

print('mean:', no_family_members_mean)
print('var:', no_family_members_var)
print('stdev:', no_family_members_std)
print('median:', no_family_members_median)
print('q1:', q1)
print('q3:', q3)
print('min:', no_family_members_min)
print('max:', no_family_members_max)

mean: 4.393899495281984
var: 3.0232551373942944
stdev: 1.7387510280066822
median: 4.0
q1: 3.0
q3: 5.0
min: 1
max: 20

keyboard_arrow_down income
income_mean = dfAd['income'].mean()
income_var = dfAd['income'].var()
income_std = dfAd['income'].std()
income_median = dfAd['income'].median()
[q1, q3] = dfAd['income'].quantile([.25, .75]).values
income_min = dfAd['income'].min()
income_max = dfAd['income'].max()

print('mean:', income_mean)
print('var:', income_var)
print('stdev:', income_std)
print('median:', income_median)
print('q1:', q1)
print('q3:', q3)
print('min:', income_min)
print('max:', income_max)

mean: 10212420.452051789
var: 131753814091836.88
stdev: 11478406.426496532
median: 7000000.0
q1: 5000000.0
q3: 13000000.0
min: 1000000.0
max: 200000000.0

keyboard_arrow_down lv-educ
https://colab.research.google.com/drive/1wdiU4z6X7O8LtE1Khs6hwkfgz2s2I-lf#scrollTo=EV2ke4YlNOp1 3/10
2/15/25, 10:23 PM [230103] ECON209_S2025__Lab_2.ipynb - Colab

# or we can get more details for a given column


lv_educ_mean = dfAd['lv_educ'].mean()
lv_educ_var = dfAd['lv_educ'].var()
lv_educ_std = dfAd['lv_educ'].std()
lv_educ_median = dfAd['lv_educ'].median()
[q1, q3] = dfAd['lv_educ'].quantile([.25, .75]).values
lv_educ_min = dfAd['lv_educ'].min()
lv_educ_max = dfAd['lv_educ'].max()

print('mean:', lv_educ_mean)
print('var:', lv_educ_var)
print('stdev:', lv_educ_std)
print('median:', lv_educ_median)
print('q1:', q1)
print('q3:', q3)
print('min:', lv_educ_min)
print('max:', lv_educ_max)

mean: 4.47816545973228
var: 5.019990663574289
stdev: 2.2405335667144755
median: 4.0
q1: 3.0
q3: 6.0
min: 0.0
max: 9.0

keyboard_arrow_down urban
# or we can get more details for a given column
urban_mean = dfAd['urban'].mean()
urban_var = dfAd['urban'].var()
urban_std = dfAd['urban'].std()
urban_median = dfAd['urban'].median()
[q1, q3] = dfAd['urban'].quantile([.25, .75]).values
urban_min = dfAd['urban'].min()
urban_max = dfAd['urban'].max()

print('mean:', urban_mean)
print('var:', urban_var)
print('stdev:', urban_std)
print('median:', urban_median)
print('q1:', q1)
print('q3:', q3)
print('min:',urban_min)
print('max:', urban_max)

mean: 0.6001755540926048
var: 0.24001752843651028
stdev: 0.48991583811559947
median: 1.0
q1: 0.0
q3: 1.0
min: 0
max: 1

keyboard_arrow_down party-member
# or we can get more details for a given column
party_member_mean = dfAd['party_member'].mean()
party_member_var = dfAd['party_member'].var()
party_member_std = dfAd['party_member'].std()
party_member_median = dfAd['party_member'].median()
[q1, q3] = dfAd['party_member'].quantile([.25, .75]).values
party_member_min = dfAd['party_member'].min()
party_member_max = dfAd['party_member'].max()

print('mean:', party_member_mean)
print('var:', party_member_var)
print('stdev:', party_member_std)
print('median:', party_member_median)

https://colab.research.google.com/drive/1wdiU4z6X7O8LtE1Khs6hwkfgz2s2I-lf#scrollTo=EV2ke4YlNOp1 4/10
2/15/25, 10:23 PM [230103] ECON209_S2025__Lab_2.ipynb - Colab
print('q1:', q1)
print('q3:', q3)
print('min:',party_member_min)
print('max:', party_member_max)

# Check the frequency of some variable


dfAd['age'].value_counts()

count

age

60.0 175

55.0 169

58.0 166

50.0 156

56.0 153

... ...

19.0 3

78.0 2

77.0 1

80.0 1

95.0 1

62 rows × 1 columns

dtype: int64

# WE can select specific columns to be analysed


dfAd[['age', 'time_in_province']].head(10)

age time_in_province

0 56.0 20

1 37.0 37

2 34.0 34

3 36.0 36

4 61.0 61

5 40.0 40

7 47.0 47

8 63.0 63

9 55.0 55

10 41.0 41

Univariate Analysis & Plotting

Univariate analysis: descriptive analysis for a single variable

keyboard_arrow_down Hisotgrams-Age
#Histograms, distribution plots, boxplots are all good univariate analysis tools
figure, axes = plt.subplots(1, 2, figsize=(20,10)) #Create a grid with multiple sub-plots if you want to display all p

#Histograms
sns.histplot(ax = axes[0], data = dfAd['age'], discrete=True)

#Distribution plots
sns.histplot(ax = axes[1], data = dfAd['age'], stat = 'probability', element = 'step')

https://colab.research.google.com/drive/1wdiU4z6X7O8LtE1Khs6hwkfgz2s2I-lf#scrollTo=EV2ke4YlNOp1 5/10
2/15/25, 10:23 PM [230103] ECON209_S2025__Lab_2.ipynb - Colab
<Axes: xlabel='age', ylabel='Probability'>

#In case you are using the free version of Colab with RAM of around 12GB, it might not be able to handle the seaborn/s
#Matplotlib/plt is an alternative in these cases, although the plots might look not as good.

plt.hist(dfAd['age'], bins = 20)


plt.show()

keyboard_arrow_down Histograms-Income
plt.hist(dfAd['income'], bins = 20)
plt.show()

https://colab.research.google.com/drive/1wdiU4z6X7O8LtE1Khs6hwkfgz2s2I-lf#scrollTo=EV2ke4YlNOp1 6/10
2/15/25, 10:23 PM [230103] ECON209_S2025__Lab_2.ipynb - Colab

keyboard_arrow_down lv-educ
plt.hist(dfAd['lv_educ'], bins = 20)
plt.show()

keyboard_arrow_down Box plots-income of party-member vs non-party member respondents


import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(8, 5))
sns.boxplot(x='party_member', y='income', data=dfAd)

plt.xlabel('Party Member (0 = Non-member, 1 = Member)')


plt.ylabel('Income')
plt.title('Box Plot of Income by Party Membership')

plt.show()

https://colab.research.google.com/drive/1wdiU4z6X7O8LtE1Khs6hwkfgz2s2I-lf#scrollTo=EV2ke4YlNOp1 7/10
2/15/25, 10:23 PM [230103] ECON209_S2025__Lab_2.ipynb - Colab

keyboard_arrow_down Bivariate Analysis & Plotting


Bivariate analysis: descriptive analysis to see relationship among different variables

#A bivariate analysis tool is the scatterplot


sns.scatterplot(x = 'lv_educ', y = 'income', data = dfAd)

<Axes: xlabel='lv_educ', ylabel='income'>

keyboard_arrow_down Random sampling


#Let's draw 1 sample of size 10 from dfAd. Pandas has built-in sample function so it's pretty easy
n_size = 100 #define the size
dfAd_sample = dfAd.sample(n = n_size, replace = False)
dfAd_sample

https://colab.research.google.com/drive/1wdiU4z6X7O8LtE1Khs6hwkfgz2s2I-lf#scrollTo=EV2ke4YlNOp1 8/10
2/15/25, 10:23 PM [230103] ECON209_S2025__Lab_2.ipynb - Colab

Unnamed:
id urban female age time_in_commune_or_ward time_in_province lv_educ no_family_members part
0

2099 2099 2472 1 0 52.0 20.0 20 4.0 4

2568 2568 1806 1 0 58.0 58.0 58 4.0 3

744 744 3536 0 0 66.0 66.0 66 6.0 9

1148 1148 859 1 1 41.0 10.0 41 6.0 4

1538 1538 3878 0 0 56.0 56.0 56 6.0 5

... ... ... ... ... ... ... ... ... ...

334 334 4488 0 0 62.0 62.0 62 0.0 2

290 290 8000 1 1 60.0 40.0 60 6.0 2

1678 1678 3793 1 0 53.0 40.0 47 4.0 3

851 851 315 1 0 47.0 25.0 47 5.0 5

4034 4034 9254 0 1 42.0 42.0 42 3.0 5

100 rows × 11 columns

keyboard_arrow_down Hypothesis testing with t-test


from scipy import stats

keyboard_arrow_down One-sample t-test


Suppose we want to test if the sample mean of dfAd_sample is the true population mean for variable "TV". Confidence level chosen here is
95%, meaning the significance level $\alpha = 0.05$.

t_res = stats.ttest_1samp(dfAd_sample["income"], popmean=income_mean)


t_res

TtestResult(statistic=-0.187535686179706, pvalue=0.8516244720942965, df=99)

keyboard_arrow_down Construct Confidence Interval


The bounds of the 95% confidence interval are the minimum and maximum values of the parameter popmean for which the p-value of the
test would be 0.05. The logic is the same for other confidence levels

ci = t_res.confidence_interval(confidence_level=0.95)
ci

ConfidenceInterval(low=8447318.963773292, high=11672681.036226708)

import numpy as np
import scipy.stats as stats

t_res = stats.ttest_1samp(dfAd_sample["income"], popmean=income_mean)

sample_mean = dfAd_sample["income"].mean()
sample_std = dfAd_sample["income"].std(ddof=1)
n = len(dfAd_sample)

alpha = 0.05
t_critical = stats.t.ppf(1 - alpha / 2, df=n-1)

margin_of_error = t_critical * (sample_std / np.sqrt(n))


ci_lower = sample_mean - margin_of_error
ci_upper = sample_mean + margin_of_error

https://colab.research.google.com/drive/1wdiU4z6X7O8LtE1Khs6hwkfgz2s2I-lf#scrollTo=EV2ke4YlNOp1 9/10
2/15/25, 10:23 PM [230103] ECON209_S2025__Lab_2.ipynb - Colab

print(f"T-statistic: {t res.statistic}")

https://colab.research.google.com/drive/1wdiU4z6X7O8LtE1Khs6hwkfgz2s2I-lf#scrollTo=EV2ke4YlNOp1 10/10

You might also like