0% found this document useful (0 votes)

16 views10 pages

230103-ECON209 S2025 Lab 2.ipynb-Colab

The document outlines a Jupyter notebook for conducting exploratory data analysis (EDA) using Python libraries such as pandas, seaborn, and matplotlib. It includes steps for importing a dataset from Google Drive, checking for missing values, and performing descriptive statistics on various columns including age, income, and education level. The analysis reveals insights into the dataset, including means, medians, and frequency counts for different variables.

Uploaded by

mthunguyen.work

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views10 pages

230103-ECON209 S2025 Lab 2.ipynb-Colab

Uploaded by

mthunguyen.work

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

2/15/25, 10:23 PM [230103] ECON209_S2025__Lab_2.

ipynb - Colab

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

Writing 𝐿𝑇
𝐴 𝑋 in Colab
𝐸
It's the same as what we do on Overleaf, but here is the guide by Colab for your convenience.

keyboard_arrow_down Importing external dataset

keyboard_arrow_down Option 1
#Upload from drive
#Remember to upload your file ONTO GOOGLE DRIVE and paste the file path EXACTLY!
from google.colab import drive
drive.mount('/content/drive')
dfAd = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/PAPI2018_sample_clean.csv') #paste the file path exactly

Show hidden output

keyboard_arrow_down Exploratory Data Analysis EDA

keyboard_arrow_down Basic EDA
print(dfAd.head())

Unnamed: 0 id urban female age time_in_commune_or_ward \

0 0 7014 1 1 56.0 20.0
1 1 7003 1 0 37.0 37.0
2 2 3780 1 1 34.0 34.0
3 3 13742 1 1 36.0 36.0
4 4 11886 0 0 61.0 61.0

time_in_province lv_educ no_family_members party_member income

0 20 4.0 2 0 5000000.0
1 37 6.0 5 0 7000000.0
2 34 8.0 4 1 15000000.0
3 36 6.0 3 1 15000000.0
4 61 0.0 3 0 5000000.0

dfAd.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 5000 non-null int64
1 id 5000 non-null int64
2 urban 5000 non-null int64
3 female 5000 non-null int64
4 age 4994 non-null float64
5 time_in_commune_or_ward 5000 non-null float64
6 time_in_province 5000 non-null int64
7 lv_educ 4997 non-null float64
8 no_family_members 5000 non-null int64
9 party_member 5000 non-null int64
10 income 4563 non-null float64
dtypes: float64(4), int64(7)
memory usage: 429.8 KB

# Check missing values

print('Missing values: %i' % dfAd.isnull().sum().sum())

Missing values: 446

https://colab.research.google.com/drive/1wdiU4z6X7O8LtE1Khs6hwkfgz2s2I-lf#scrollTo=EV2ke4YlNOp1 1/10
2/15/25, 10:23 PM [230103] ECON209_S2025__Lab_2.ipynb - Colab
# Drop missing data if any
dfAd = dfAd.dropna()

# We can quickly get descriptive table

dfAd.describe()

Unnamed:
id urban female age time_in_commune_or_ward time_in_province lv_e
0

count 4557.000000 4557.000000 4557.000000 4557.000000 4557.000000 4557.000000 4557.000000 4557.000

mean 2508.703533 7147.316656 0.600176 0.523371 48.914198 38.429339 45.693878 4.478

std 1443.518452 4141.753328 0.489916 0.499508 11.582921 57.320103 50.448311 2.240

min 0.000000 1.000000 0.000000 0.000000 18.000000 1.000000 2.000000 0.000

25% 1261.000000 3603.000000 0.000000 0.000000 40.000000 22.000000 33.000000 3.000

50% 2509.000000 7064.000000 1.000000 1.000000 50.000000 35.000000 43.000000 4.000

75% 3760.000000 10684.000000 1.000000 1.000000 58.000000 48.000000 54.000000 6.000

max 4998.000000 14445.000000 1.000000 1.000000 95.000000 888.000000 888.000000 9.000

Looking at 'count', we see that there's no missing value.

From comparing median (50%) with mean, we can tell if the distribution is likely to be left-skewed or right-skewed. If median < mean,
then the distribution is right skewed. If the median > mean, then the distribution is left skewed.

keyboard_arrow_down Age
# or we can get more details for a given column
age_mean = dfAd['age'].mean()
age_var = dfAd['age'].var()
age_std = dfAd['age'].std()
age_median = dfAd['age'].median()
[q1, q3] = dfAd['age'].quantile([.25, .75]).values
age_min = dfAd['age'].min()
age_max = dfAd['age'].max()

print('mean:', age_mean)
print('var:', age_var)
print('stdev:', age_std)
print('median:', age_median)
print('q1:', q1)
print('q3:', q3)
print('min:', age_min)
print('max:', age_max)

mean: 48.914197937239415
var: 134.16405869039997
stdev: 11.582920991287127
median: 50.0
q1: 40.0
q3: 58.0
min: 18.0
max: 95.0

keyboard_arrow_down time-in-province
time_in_province_mean = dfAd['time_in_province'].mean()
time_in_province_var = dfAd['time_in_province'].var()
time_in_province_std = dfAd['time_in_province'].std()
time_in_province_median = dfAd['time_in_province'].median()
[q1, q3] = dfAd['time_in_province'].quantile([.25, .75]).values
time_in_province_min = dfAd['time_in_province'].min()
time_in_province_max = dfAd['time_in_province'].max()

print('mean:', time_in_province_mean)
print('var:', time_in_province_var)
print('stdev:', time_in_province_std)

https://colab.research.google.com/drive/1wdiU4z6X7O8LtE1Khs6hwkfgz2s2I-lf#scrollTo=EV2ke4YlNOp1 2/10
2/15/25, 10:23 PM [230103] ECON209_S2025__Lab_2.ipynb - Colab
print('median:', time_in_province_median)
print('q1:', q1)
print('q3:', q3)
print('min:', time_in_province_min)
print('max:', time_in_province_max)

mean: 45.69387755102041
var: 2545.0320366952697
stdev: 50.448310543518396
median: 43.0
q1: 33.0
q3: 54.0
min: 2
max: 888

keyboard_arrow_down no-famliy-members
no_family_members_mean = dfAd['no_family_members'].mean()
no_family_members_var = dfAd['no_family_members'].var()
no_family_members_std = dfAd['no_family_members'].std()
no_family_members_median = dfAd['no_family_members'].median()
[q1, q3] = dfAd['no_family_members'].quantile([.25, .75]).values
no_family_members_min = dfAd['no_family_members'].min()
no_family_members_max = dfAd['no_family_members'].max()

print('mean:', no_family_members_mean)
print('var:', no_family_members_var)
print('stdev:', no_family_members_std)
print('median:', no_family_members_median)
print('q1:', q1)
print('q3:', q3)
print('min:', no_family_members_min)
print('max:', no_family_members_max)

mean: 4.393899495281984
var: 3.0232551373942944
stdev: 1.7387510280066822
median: 4.0
q1: 3.0
q3: 5.0
min: 1
max: 20

keyboard_arrow_down income
income_mean = dfAd['income'].mean()
income_var = dfAd['income'].var()
income_std = dfAd['income'].std()
income_median = dfAd['income'].median()
[q1, q3] = dfAd['income'].quantile([.25, .75]).values
income_min = dfAd['income'].min()
income_max = dfAd['income'].max()

print('mean:', income_mean)
print('var:', income_var)
print('stdev:', income_std)
print('median:', income_median)
print('q1:', q1)
print('q3:', q3)
print('min:', income_min)
print('max:', income_max)

mean: 10212420.452051789
var: 131753814091836.88
stdev: 11478406.426496532
median: 7000000.0
q1: 5000000.0
q3: 13000000.0
min: 1000000.0
max: 200000000.0

keyboard_arrow_down lv-educ
https://colab.research.google.com/drive/1wdiU4z6X7O8LtE1Khs6hwkfgz2s2I-lf#scrollTo=EV2ke4YlNOp1 3/10
2/15/25, 10:23 PM [230103] ECON209_S2025__Lab_2.ipynb - Colab

# or we can get more details for a given column

lv_educ_mean = dfAd['lv_educ'].mean()
lv_educ_var = dfAd['lv_educ'].var()
lv_educ_std = dfAd['lv_educ'].std()
lv_educ_median = dfAd['lv_educ'].median()
[q1, q3] = dfAd['lv_educ'].quantile([.25, .75]).values
lv_educ_min = dfAd['lv_educ'].min()
lv_educ_max = dfAd['lv_educ'].max()

print('mean:', lv_educ_mean)
print('var:', lv_educ_var)
print('stdev:', lv_educ_std)
print('median:', lv_educ_median)
print('q1:', q1)
print('q3:', q3)
print('min:', lv_educ_min)
print('max:', lv_educ_max)

mean: 4.47816545973228
var: 5.019990663574289
stdev: 2.2405335667144755
median: 4.0
q1: 3.0
q3: 6.0
min: 0.0
max: 9.0

keyboard_arrow_down urban
# or we can get more details for a given column
urban_mean = dfAd['urban'].mean()
urban_var = dfAd['urban'].var()
urban_std = dfAd['urban'].std()
urban_median = dfAd['urban'].median()
[q1, q3] = dfAd['urban'].quantile([.25, .75]).values
urban_min = dfAd['urban'].min()
urban_max = dfAd['urban'].max()

print('mean:', urban_mean)
print('var:', urban_var)
print('stdev:', urban_std)
print('median:', urban_median)
print('q1:', q1)
print('q3:', q3)
print('min:',urban_min)
print('max:', urban_max)

mean: 0.6001755540926048
var: 0.24001752843651028
stdev: 0.48991583811559947
median: 1.0
q1: 0.0
q3: 1.0
min: 0
max: 1

keyboard_arrow_down party-member
# or we can get more details for a given column
party_member_mean = dfAd['party_member'].mean()
party_member_var = dfAd['party_member'].var()
party_member_std = dfAd['party_member'].std()
party_member_median = dfAd['party_member'].median()
[q1, q3] = dfAd['party_member'].quantile([.25, .75]).values
party_member_min = dfAd['party_member'].min()
party_member_max = dfAd['party_member'].max()

print('mean:', party_member_mean)
print('var:', party_member_var)
print('stdev:', party_member_std)
print('median:', party_member_median)

https://colab.research.google.com/drive/1wdiU4z6X7O8LtE1Khs6hwkfgz2s2I-lf#scrollTo=EV2ke4YlNOp1 4/10
2/15/25, 10:23 PM [230103] ECON209_S2025__Lab_2.ipynb - Colab
print('q1:', q1)
print('q3:', q3)
print('min:',party_member_min)
print('max:', party_member_max)

# Check the frequency of some variable

dfAd['age'].value_counts()

count

age

60.0 175

55.0 169

58.0 166

50.0 156

56.0 153

... ...

19.0 3

78.0 2

77.0 1

80.0 1

95.0 1

62 rows × 1 columns

dtype: int64

# WE can select specific columns to be analysed

dfAd[['age', 'time_in_province']].head(10)

age time_in_province

0 56.0 20

1 37.0 37

2 34.0 34

3 36.0 36

4 61.0 61

5 40.0 40

7 47.0 47

8 63.0 63

9 55.0 55

10 41.0 41

Univariate Analysis & Plotting

Univariate analysis: descriptive analysis for a single variable

keyboard_arrow_down Hisotgrams-Age
#Histograms, distribution plots, boxplots are all good univariate analysis tools
figure, axes = plt.subplots(1, 2, figsize=(20,10)) #Create a grid with multiple sub-plots if you want to display all p

#Histograms
sns.histplot(ax = axes[0], data = dfAd['age'], discrete=True)

#Distribution plots
sns.histplot(ax = axes[1], data = dfAd['age'], stat = 'probability', element = 'step')

https://colab.research.google.com/drive/1wdiU4z6X7O8LtE1Khs6hwkfgz2s2I-lf#scrollTo=EV2ke4YlNOp1 5/10
2/15/25, 10:23 PM [230103] ECON209_S2025__Lab_2.ipynb - Colab
<Axes: xlabel='age', ylabel='Probability'>

#In case you are using the free version of Colab with RAM of around 12GB, it might not be able to handle the seaborn/s
#Matplotlib/plt is an alternative in these cases, although the plots might look not as good.

plt.hist(dfAd['age'], bins = 20)

plt.show()

keyboard_arrow_down Histograms-Income
plt.hist(dfAd['income'], bins = 20)
plt.show()

https://colab.research.google.com/drive/1wdiU4z6X7O8LtE1Khs6hwkfgz2s2I-lf#scrollTo=EV2ke4YlNOp1 6/10
2/15/25, 10:23 PM [230103] ECON209_S2025__Lab_2.ipynb - Colab

keyboard_arrow_down lv-educ
plt.hist(dfAd['lv_educ'], bins = 20)
plt.show()

keyboard_arrow_down Box plots-income of party-member vs non-party member respondents

import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(8, 5))
sns.boxplot(x='party_member', y='income', data=dfAd)

plt.xlabel('Party Member (0 = Non-member, 1 = Member)')

plt.ylabel('Income')
plt.title('Box Plot of Income by Party Membership')

plt.show()

https://colab.research.google.com/drive/1wdiU4z6X7O8LtE1Khs6hwkfgz2s2I-lf#scrollTo=EV2ke4YlNOp1 7/10
2/15/25, 10:23 PM [230103] ECON209_S2025__Lab_2.ipynb - Colab

keyboard_arrow_down Bivariate Analysis & Plotting

Bivariate analysis: descriptive analysis to see relationship among different variables

#A bivariate analysis tool is the scatterplot

sns.scatterplot(x = 'lv_educ', y = 'income', data = dfAd)

<Axes: xlabel='lv_educ', ylabel='income'>

keyboard_arrow_down Random sampling

#Let's draw 1 sample of size 10 from dfAd. Pandas has built-in sample function so it's pretty easy
n_size = 100 #define the size
dfAd_sample = dfAd.sample(n = n_size, replace = False)
dfAd_sample

https://colab.research.google.com/drive/1wdiU4z6X7O8LtE1Khs6hwkfgz2s2I-lf#scrollTo=EV2ke4YlNOp1 8/10
2/15/25, 10:23 PM [230103] ECON209_S2025__Lab_2.ipynb - Colab

Unnamed:
id urban female age time_in_commune_or_ward time_in_province lv_educ no_family_members part
0

2099 2099 2472 1 0 52.0 20.0 20 4.0 4

2568 2568 1806 1 0 58.0 58.0 58 4.0 3

744 744 3536 0 0 66.0 66.0 66 6.0 9

1148 1148 859 1 1 41.0 10.0 41 6.0 4

1538 1538 3878 0 0 56.0 56.0 56 6.0 5

... ... ... ... ... ... ... ... ... ...

334 334 4488 0 0 62.0 62.0 62 0.0 2

290 290 8000 1 1 60.0 40.0 60 6.0 2

1678 1678 3793 1 0 53.0 40.0 47 4.0 3

851 851 315 1 0 47.0 25.0 47 5.0 5

4034 4034 9254 0 1 42.0 42.0 42 3.0 5

100 rows × 11 columns

keyboard_arrow_down Hypothesis testing with t-test

from scipy import stats

keyboard_arrow_down One-sample t-test

Suppose we want to test if the sample mean of dfAd_sample is the true population mean for variable "TV". Confidence level chosen here is
95%, meaning the significance level $\alpha = 0.05$.

t_res = stats.ttest_1samp(dfAd_sample["income"], popmean=income_mean)

t_res

TtestResult(statistic=-0.187535686179706, pvalue=0.8516244720942965, df=99)

keyboard_arrow_down Construct Confidence Interval

The bounds of the 95% confidence interval are the minimum and maximum values of the parameter popmean for which the p-value of the
test would be 0.05. The logic is the same for other confidence levels

ci = t_res.confidence_interval(confidence_level=0.95)
ci

ConfidenceInterval(low=8447318.963773292, high=11672681.036226708)

import numpy as np
import scipy.stats as stats

t_res = stats.ttest_1samp(dfAd_sample["income"], popmean=income_mean)

sample_mean = dfAd_sample["income"].mean()
sample_std = dfAd_sample["income"].std(ddof=1)
n = len(dfAd_sample)

alpha = 0.05
t_critical = stats.t.ppf(1 - alpha / 2, df=n-1)

margin_of_error = t_critical * (sample_std / np.sqrt(n))

ci_lower = sample_mean - margin_of_error
ci_upper = sample_mean + margin_of_error

https://colab.research.google.com/drive/1wdiU4z6X7O8LtE1Khs6hwkfgz2s2I-lf#scrollTo=EV2ke4YlNOp1 9/10
2/15/25, 10:23 PM [230103] ECON209_S2025__Lab_2.ipynb - Colab

print(f"T-statistic: {t res.statistic}")

https://colab.research.google.com/drive/1wdiU4z6X7O8LtE1Khs6hwkfgz2s2I-lf#scrollTo=EV2ke4YlNOp1 10/10

MATH 120 Introduction To Statistics Week 8 Final Exam
No ratings yet
MATH 120 Introduction To Statistics Week 8 Final Exam
6 pages
Week 2 Solutions
No ratings yet
Week 2 Solutions
7 pages
Biostatistics Mcqs With Key
97% (29)
Biostatistics Mcqs With Key
13 pages
Aosdijfpqoiew
No ratings yet
Aosdijfpqoiew
6 pages
Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
Data Analysis
No ratings yet
Data Analysis
44 pages
Import: Sys - Executable - M Pip Install
No ratings yet
Import: Sys - Executable - M Pip Install
23 pages
Review: Normal Distribution
No ratings yet
Review: Normal Distribution
46 pages
Data Preprocessing - Ipynb - Colaboratory
No ratings yet
Data Preprocessing - Ipynb - Colaboratory
7 pages
Panda Merged
No ratings yet
Panda Merged
19 pages
Student Notebook HR Analysis
No ratings yet
Student Notebook HR Analysis
11 pages
Student - Linear Regression Example - Colaboratory
No ratings yet
Student - Linear Regression Example - Colaboratory
6 pages
ML Lab Manual 1-10
No ratings yet
ML Lab Manual 1-10
58 pages
Lecture 12 - Art and Science of Data Visualization
No ratings yet
Lecture 12 - Art and Science of Data Visualization
21 pages
Student Analysis
No ratings yet
Student Analysis
16 pages
Modulo 4 - EDA - Ipynb - Colaboratory
No ratings yet
Modulo 4 - EDA - Ipynb - Colaboratory
21 pages
DA Lab Manual r22
No ratings yet
DA Lab Manual r22
31 pages
ML Cops
No ratings yet
ML Cops
17 pages
Classwork 10
No ratings yet
Classwork 10
1 page
Codealpha Studentseda
No ratings yet
Codealpha Studentseda
2 pages
Experiment 2
No ratings yet
Experiment 2
5 pages
Statistical Data Analysis - Ipynb - Colaboratory
No ratings yet
Statistical Data Analysis - Ipynb - Colaboratory
6 pages
Online Food Orders Analysis Using Python
No ratings yet
Online Food Orders Analysis Using Python
12 pages
Practical File Questions With Answers
No ratings yet
Practical File Questions With Answers
7 pages
00 - Project - Your First Data Science Project - Jupyter Notebook
No ratings yet
00 - Project - Your First Data Science Project - Jupyter Notebook
8 pages
Observation: Import As Import As Import As Import As
No ratings yet
Observation: Import As Import As Import As Import As
31 pages
Data Frame Notes3
No ratings yet
Data Frame Notes3
39 pages
Dsbda Assignment 1
No ratings yet
Dsbda Assignment 1
5 pages
Prac3 23bme053
No ratings yet
Prac3 23bme053
5 pages
Heart Disease Prediction (1) (1) - 1
No ratings yet
Heart Disease Prediction (1) (1) - 1
1 page
Stroke Prediction
No ratings yet
Stroke Prediction
10 pages
Lab 3
No ratings yet
Lab 3
3 pages
Ai Tools and Applications-Lab
No ratings yet
Ai Tools and Applications-Lab
33 pages
Assignment 8
No ratings yet
Assignment 8
4 pages
Dav 2024 Pyq
No ratings yet
Dav 2024 Pyq
7 pages
DSBDA3 - Jupyter Notebook
No ratings yet
DSBDA3 - Jupyter Notebook
12 pages
Credit Card Fraud Detection With CNN 99 Accuracy
No ratings yet
Credit Card Fraud Detection With CNN 99 Accuracy
12 pages
Exp - 2-EDA - CaliforniaData Set - HeatMap - PairPlot-checkpoint - Jupyter Notebook
No ratings yet
Exp - 2-EDA - CaliforniaData Set - HeatMap - PairPlot-checkpoint - Jupyter Notebook
12 pages
Tutorial 2 QB & QP
No ratings yet
Tutorial 2 QB & QP
4 pages
Apex Financial Services Loan Data Automation
No ratings yet
Apex Financial Services Loan Data Automation
18 pages
QP DAV 3rd Sem Dec 2023
No ratings yet
QP DAV 3rd Sem Dec 2023
12 pages
Data Cleaning
No ratings yet
Data Cleaning
83 pages
Predictive+Modelling+-+Logistic+Regression+-+Student+Version-New2.3.ipynb - Colaboratory
No ratings yet
Predictive+Modelling+-+Logistic+Regression+-+Student+Version-New2.3.ipynb - Colaboratory
12 pages
Prints
No ratings yet
Prints
43 pages
Python
No ratings yet
Python
32 pages
Artificial Neural Network (Ann)
No ratings yet
Artificial Neural Network (Ann)
1 page
Problem 1:: Readingcsv PD Read - Excel (Readingcsv) Readingcsv Head
No ratings yet
Problem 1:: Readingcsv PD Read - Excel (Readingcsv) Readingcsv Head
18 pages
Matplotlib Library in Python
No ratings yet
Matplotlib Library in Python
85 pages
Ds&bda 1-14
No ratings yet
Ds&bda 1-14
95 pages
Unit3 - 3) Pandas - Ipynb - Colab
No ratings yet
Unit3 - 3) Pandas - Ipynb - Colab
11 pages
Pyt Manual 1
No ratings yet
Pyt Manual 1
85 pages
Prog Found Final
No ratings yet
Prog Found Final
10 pages
ML Lab Manual Final
No ratings yet
ML Lab Manual Final
36 pages
Analyzing Student Performance in Exams Using Python
No ratings yet
Analyzing Student Performance in Exams Using Python
11 pages
Student Dropout
No ratings yet
Student Dropout
38 pages
Data Science Practical Problems
No ratings yet
Data Science Practical Problems
40 pages
Pandas Cheat Sheet
No ratings yet
Pandas Cheat Sheet
19 pages
Unit3 - Cleaning - Preparing - Data - Jupyter Notebook
No ratings yet
Unit3 - Cleaning - Preparing - Data - Jupyter Notebook
10 pages
Pandas Revision1
No ratings yet
Pandas Revision1
2 pages
Pandas Introduction: What Is Python Pandas Used For?
No ratings yet
Pandas Introduction: What Is Python Pandas Used For?
28 pages
Interactive Mapping in Python With UK Census Data
No ratings yet
Interactive Mapping in Python With UK Census Data
24 pages
MCQ On Dataframe
No ratings yet
MCQ On Dataframe
11 pages
Lab2.2 Kritika
No ratings yet
Lab2.2 Kritika
10 pages
Dự báo và phát triển kinh doanh
No ratings yet
Dự báo và phát triển kinh doanh
43 pages
Microsoft Visual Basic Interview Questions: Microsoft VB Certification Review
From Everand
Microsoft Visual Basic Interview Questions: Microsoft VB Certification Review
Equity Press
No ratings yet
Econometric Theory and Application
No ratings yet
Econometric Theory and Application
22 pages
Section I: Multiple-Choice Questions (5 Marks) : STAT 201-Exam I College of Business (Fall 2016)
No ratings yet
Section I: Multiple-Choice Questions (5 Marks) : STAT 201-Exam I College of Business (Fall 2016)
6 pages
CH 2 Central Tendency F P. 14-32
No ratings yet
CH 2 Central Tendency F P. 14-32
20 pages
Chapter 3: Risk Identification and Measurement: Risk Management and Insurance by Harrington & Niehaus (Class 4)
100% (2)
Chapter 3: Risk Identification and Measurement: Risk Management and Insurance by Harrington & Niehaus (Class 4)
26 pages
Quantitative Techniques - I
0% (1)
Quantitative Techniques - I
6 pages
Chapter6 Exerciesextra
No ratings yet
Chapter6 Exerciesextra
10 pages
B.tech TFT 2nd Year Syllabus
No ratings yet
B.tech TFT 2nd Year Syllabus
34 pages
Business Statistics Chapter 2
No ratings yet
Business Statistics Chapter 2
48 pages
Rewards and Organizational Commitment A Study Among Malaysian Smes Employees
No ratings yet
Rewards and Organizational Commitment A Study Among Malaysian Smes Employees
21 pages
Bachelor's Degree Programme (BDP) : Assignment 2019-20
No ratings yet
Bachelor's Degree Programme (BDP) : Assignment 2019-20
3 pages
Frequency Table: Descriptive Statistics
No ratings yet
Frequency Table: Descriptive Statistics
27 pages
Descriptive Statistics and Probability
No ratings yet
Descriptive Statistics and Probability
34 pages
1 +1686+-+Ginda+M A +Siregar,+Wahyudin,+Tatang+Herman,+Sufyani+Prabawanto+ (1-16)
No ratings yet
1 +1686+-+Ginda+M A +Siregar,+Wahyudin,+Tatang+Herman,+Sufyani+Prabawanto+ (1-16)
16 pages
Psych 110 Notes Chap 1 2
No ratings yet
Psych 110 Notes Chap 1 2
10 pages
Gender Classification
No ratings yet
Gender Classification
5 pages
Fin423 - BATA Analysis - Adiba Iqbal - 21304016
No ratings yet
Fin423 - BATA Analysis - Adiba Iqbal - 21304016
9 pages
Chapter Four Findings and Discussion
No ratings yet
Chapter Four Findings and Discussion
7 pages
QT Question Paper and Answer Key
No ratings yet
QT Question Paper and Answer Key
15 pages
Advances in Ecological Research v.36
No ratings yet
Advances in Ecological Research v.36
209 pages
CCW331 Set4
No ratings yet
CCW331 Set4
5 pages
Solid Liquid Filtration and Separation Technology - 1996 - Rushton - Appendix A Particle Size Shape and Size
No ratings yet
Solid Liquid Filtration and Separation Technology - 1996 - Rushton - Appendix A Particle Size Shape and Size
15 pages
Circuit Training Probability Distributions and Random Variables Name
No ratings yet
Circuit Training Probability Distributions and Random Variables Name
4 pages
Gregory Springer - The Role of Accompaniment
No ratings yet
Gregory Springer - The Role of Accompaniment
19 pages
Statics: " " Seems To Be A
No ratings yet
Statics: " " Seems To Be A
22 pages
Dispersion
No ratings yet
Dispersion
10 pages

230103-ECON209 S2025 Lab 2.ipynb-Colab

Uploaded by

230103-ECON209 S2025 Lab 2.ipynb-Colab

Uploaded by

2/15/25, 10:23 PM [230103] ECON209_S2025__Lab_2.

keyboard_arrow_down Importing external dataset

Show hidden output

keyboard_arrow_down Exploratory Data Analysis EDA

Unnamed: 0 id urban female age time_in_commune_or_ward \

time_in_province lv_educ no_family_members party_member income

# Check missing values

Missing values: 446

# We can quickly get descriptive table

count 4557.000000 4557.000000 4557.000000 4557.000000 4557.000000 4557.000000 4557.000000 4557.000

mean 2508.703533 7147.316656 0.600176 0.523371 48.914198 38.429339 45.693878 4.478

std 1443.518452 4141.753328 0.489916 0.499508 11.582921 57.320103 50.448311 2.240

min 0.000000 1.000000 0.000000 0.000000 18.000000 1.000000 2.000000 0.000

25% 1261.000000 3603.000000 0.000000 0.000000 40.000000 22.000000 33.000000 3.000

50% 2509.000000 7064.000000 1.000000 1.000000 50.000000 35.000000 43.000000 4.000

75% 3760.000000 10684.000000 1.000000 1.000000 58.000000 48.000000 54.000000 6.000

max 4998.000000 14445.000000 1.000000 1.000000 95.000000 888.000000 888.000000 9.000

Looking at 'count', we see that there's no missing value.

# or we can get more details for a given column

# Check the frequency of some variable

# WE can select specific columns to be analysed

Univariate Analysis & Plotting

Univariate analysis: descriptive analysis for a single variable

plt.hist(dfAd['age'], bins = 20)

keyboard_arrow_down Box plots-income of party-member vs non-party member respondents

plt.xlabel('Party Member (0 = Non-member, 1 = Member)')

keyboard_arrow_down Bivariate Analysis & Plotting

#A bivariate analysis tool is the scatterplot

<Axes: xlabel='lv_educ', ylabel='income'>

keyboard_arrow_down Random sampling

2099 2099 2472 1 0 52.0 20.0 20 4.0 4

2568 2568 1806 1 0 58.0 58.0 58 4.0 3

744 744 3536 0 0 66.0 66.0 66 6.0 9

1148 1148 859 1 1 41.0 10.0 41 6.0 4

1538 1538 3878 0 0 56.0 56.0 56 6.0 5

334 334 4488 0 0 62.0 62.0 62 0.0 2

290 290 8000 1 1 60.0 40.0 60 6.0 2

1678 1678 3793 1 0 53.0 40.0 47 4.0 3

851 851 315 1 0 47.0 25.0 47 5.0 5

4034 4034 9254 0 1 42.0 42.0 42 3.0 5

100 rows × 11 columns

keyboard_arrow_down Hypothesis testing with t-test

keyboard_arrow_down One-sample t-test

t_res = stats.ttest_1samp(dfAd_sample["income"], popmean=income_mean)

TtestResult(statistic=-0.187535686179706, pvalue=0.8516244720942965, df=99)

keyboard_arrow_down Construct Confidence Interval

t_res = stats.ttest_1samp(dfAd_sample["income"], popmean=income_mean)

margin_of_error = t_critical * (sample_std / np.sqrt(n))

You might also like