0% found this document useful (0 votes)
32 views21 pages

SB Assignment 1 (Group 68)

The document outlines a group assignment cover sheet for a statistics project involving four students, detailing their personal information, unit details, and assignment specifics. It includes a comprehensive analysis of data cleaning, summary statistics, and visualizations such as histograms and scatter plots related to wage, age, and education of US workers, as well as household data from HCMC. The assignment emphasizes the importance of accurate data representation and analysis in understanding economic conditions and social dynamics.

Uploaded by

Hong Phuc Nguyen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views21 pages

SB Assignment 1 (Group 68)

The document outlines a group assignment cover sheet for a statistics project involving four students, detailing their personal information, unit details, and assignment specifics. It includes a comprehensive analysis of data cleaning, summary statistics, and visualizations such as histograms and scatter plots related to wage, age, and education of US workers, as well as household data from HCMC. The assignment emphasizes the importance of accurate data representation and analysis in understanding economic conditions and social dynamics.

Uploaded by

Hong Phuc Nguyen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

GROUP ASSIGNMENT COVER SHEET


STUDENT DETAILS

Student name: Nguyễn Hồng Phúc Student ID number: 24007192

Student name: Nguyễn Vĩ Phong Student ID number: 23005714

Student name: Trương Trung Kiên Student ID number: 24006810

Student name: Nguyễn Trọng Bảo Student ID number: 24007721

Student name: Student ID number:


UNIT AND TUTORIAL DETAILS

Unit name: Statistics for Business Unit number: SB - T12425PWB - 2


Wednesday
Tutorial/Lecture: Statistic Report Class day and time: 12:00 - 15:15
Lecturer or Tutor name: Võ Đức Hoàng Vũ
ASSIGNMENT DETAILS

Title: Assignment 1 - Group project


Length: 1826 words Due date: 03/12/2024 Date submitted: 03/12/2024

DECLARATION
x I hold a copy of this assignment if the original is lost or damaged.

x I hereby certify that no part of this assignment or product has been copied from any other student’s work or
from any other source except where due acknowledgement is made in the assignment.
x I hereby certify that no part of this assignment or product has been submitted by me in another
(previous or current) assessment, except where appropriately referenced, and with prior permission
from the Lecturer / Tutor / Unit Coordinator for this unit.
x No part of the assignment/product has been written/ produced for me by any other person except
where collaboration has been authorised by the Lecturer / Tutor /Unit Coordinator concerned.
x I am aware that this work may be reproduced and submitted to plagiarism detection software programs for
the purpose of detecting possible plagiarism (which may retain a copy on its database for future
plagiarism checking).

Student’s signature: HONG PHUC


Student’s signature: VI PHONG
Student’s signature: TRUNG KIEN
Student’s signature: TRONG BAO
Student’s signature:

Note: An examiner or lecturer / tutor has the right to not mark this assignment if the above declaration has not
been signed.
ASSIGNMENT 1
Group 68
UEH - International School Business
SB-T12425PWB-2
Mr. Vo Duc Hoang Vu
December 03, 2024

Group Members
Nguyễn Hồng Phúc
Nguyễn Vĩ Phong
Trương Trung Kiên
Nguyễn Trọng Bảo
Question 1:
1.1 Clean data:
- First up, copy the data over to a new sheet to preserve the original data.
+ Step 1: Press the + at the bottom right corner of Excel to create a new
worksheet and name it “Clean_Data”

+ Step 2: Sheet1 → Ctrl + A to select all → Ctrl + C to copy → Clean_Data →


Ctrl + V to paste.
- Remove blanks:
+ Step 1: Select data → Ctrl + G to open Go To pop up → Special → Blanks →
Ok

+ Step 2: Home tab → Cells group → Delete → Delete Sheet Row

- Remove duplicates:
- There are upwards of 500 pairs of data that are completely similar in every single
column, which is highly illogical, therefore it’s best if we remove the duplicates to
ensure the accuracy of the data.
+ Step 1: Select data → Data tab→ Data tools → Remove duplicates → Ok
- The aim of this data is most likely to survey the workers of the US to figure out
the general jobs situation there. This would include people who work
high-paying jobs (CEO, etc) or people who stay and work for a single employer
for a long time. Therefore, we shouldn’t eliminate the outliers as it will make our
data more biased towards middle to low-class workers.
- After cleaning the data, 18271 entries were left.

1.2 Make summary statistics for all continuous variables (mean, mode, median, min,
max, standard deviation), and Utilizing Formulas to Compute Statistical Measures for
Various Variables
- Continuous variables include: wage
- Compute Statistical Measures for Various Variables using the following Excel
formulas:
+ Mean (=AVERAGE(C2:C18272))
+ Median (=MEDIAN(C2:C18272))
+ Mode(=MODE.SNGL(C2:C18272))
+ Standard deviation (=STDEV.S(C2:C18272))
+ Minimum (=MIN(C2:C18272))
+ Maximum (=MAX(C2:C18272))
- Comments (see below for picture):

+ Wage:
● The mean wage is $20.11/hour, which suggests that the data set represents the average
to high-income class, making more than the US minimum wage (which currently is
$7.25/hour).
● The median wage is $15.63/hour, lower than the mean. This shows that the data is
right-skewed, with fewer people making high wages and more people making low
wages.
● Supporting the median is the mode wage of $10/hour, showing the wage that has the
most earners, as well as indicating that the majority of workers in the data set are
making just slightly more than minimum wage.
● The standard deviation for wages is 19.47, which is really high. The variation between
people who work low-income jobs and people with high ones would, as such, be big.
● The minimum wage is $5/hour, and the maximum wage is $491/hour, showing the
massive difference between wages, thus once again reinforcing the high standard
deviation. Furthermore, the data shows that there are, in fact, people working under
minimum wage as well as those working extremely high-paying jobs, pulling the
mean wage up.
1.3. Create histogram for wage, edu, and age. Give your comments on the shape of these
histograms.

● Step 1: Select the "wage" column in the sheet "Clean_Data" → Click Insert → Click
the symbol "Statistical" → Choose the first chart "Histogram"
● Step 2: Right-click on the chart → Select format data series → Change the parameters
of bin width, numbers of bins
● Step 3: Based on the Sturges’ Rule to choose number of bins:
+ k (bins) = 1+ 3.3 x log(n) (where n: sample size, in this datasheet we have n = 18271
after ‘clean data’ stage)
=> k = 1+3.3 x log(18271) = 15.06 ≈ 15
=> We choose 15 for the number of bins
+ Next, for the ‘bin width’, after choosing the appropriate k, we pick up 5.0 as the nice
bin width
Finally, we have histogram for wage of US workers

- Comments:
The distribution of wages in the United States illustrates that the majority of workers
earn between $5 and $25 per hour. The modal wage is around $11 per hour, with only
a small percentage of workers earning more than $50 per hour. Due to the
right-skewed histogram, most values are focused on the left side of the graph. Besides,
the data is derived from a survey conducted among 18,021 American workers after
cleaning data, indicating a significant wage disparity in the United States – while most
workers earn relatively modest wages, a small segment of the workforce commands
exceptionally high hourly rates.
● We do the same three above steps for ‘age’ and ‘edu’ respectively
● Histogram for age of US workers:

- We adjust the number of bins to 13


- Based on the formula of bin limits:
𝑋 𝑚𝑎𝑥 − 𝑋 𝑚𝑖𝑛 59−21
Bin limits = 𝑘
= 13
= 2. 9 ≃ 3. 0
- Comments:
The U.S. workforce age distribution primarily ranges from 21 to 51 years old, peaking
in the 36-46 age group. Workers in the 51-61 age group, which is the oldest, make up
a smaller portion of the workforce.
● Histogram for education of US workers:

- We adjust the number of bins to 17


- Based on the formula of bin limits:
𝑋 𝑚𝑎𝑥 − 𝑋 𝑚𝑖𝑛 17−0
Bin limits = 𝑘
= 17
= 1. 0
- Comments:
The US workers' education levels typically range from 10 to 17 years, with a
left-skewed distribution, which means that the majority of workers have completed
more than 12 years of education. Though 12 years of school attendance is the most
frequent level, a large number of workers have pursued higher education. Only a small
percentage of US employees spent less than 10 years of schooling.
1.4. Make a table of frequency for each of the dummy variables (categorical variables)
defined above and give your comments.

- There are two dummy variables in this dataset: male and race. Therefore, we will
make a frequency table for each of them respectively.

+ Step 1: Select the column D (male) → Click Insert → Click Pivot Chart to open the
"Create PivotChart" box → Press OK.
+ Step 2: On the new sheet, in the PivotTable Fields located on the right, click on 'male' and
under the Field Name section. Then drag ‘male’ to the ‘Rows’ section.

+ Step 3: In the Values section, click , Select "Count”, and press OK.

● Finally, we have the frequency table for the dummy variable ‘male’

In this frequency table, there are 9,119 male people and 9,152 females which relatively share
an equal proportion.
● We do the same steps to create the frequency table for ‘race’

In this frequency table, white individuals have the highest proportion with 12,268 people,
followed by black individuals with 4,097 people, and other races have the smallest proportion with
1,656 people.

Question 2 (50 marks):

Using mclfull.xlsx to answer this question. Data for this question is stored in mclfull.xlsx,
including 491 households in HCMC in 2020.

§ expense: household expenditure (mil. VND/month)

§ income: household monthly income (mil. VND/month)

§ age_wife: age of the wife (or female partner)

§ age_husband: age of the husband (or male partner)

§ occupation_wife: occupation of the wife (or female partner)

§ occupation_husband: occupation of the husband (or male partner)

§ hhsize: Household size (members)

§ children: % children in the household

a. make a table of summary statistics for continuous variables,

b. produce contingency tables for each of the categorical/dummy variables,

c. make a histogram of expense,

d. Make scatter plots for expense and each of the continuous variables. From the scatter
plots, point out the outliers.
A. Make a table of summary statistics for continuous variables

1/ Change “thunhap” to “income”.

2/ Go to “File” → “Options” → “ Add-Ins” → In the “ Manage” box, make sure to set “


Excel Add-Ins” → “Go” → Choose “ Analysis ToolPak” → “OK”

3/ Select “income”; “age_husband”; “expense”; “children” columns.

4/Choose: Data tab → “Choose Data Analysis” → Select “Descriptive Statistics” →Add: “
$A$1:$D$492” to Input Range; “Labels in first row”; “Summary statistics”; Choose random
for Output Range ( ex: $G$1) → Click “OK”
⇒ A table of summary statistics for continuous variables:

B. Produce contingency tables for each of the categorical/dummy variables

1/ Choose “Insert tab” → “PivotChart” → “PivotTable” → “OK”

2/ In PivotChart Fields, choose “occupation_husband” and “occupation_wife”; Then drag


“occupation_wife” to Columns/ Legend ( Series ), “occupation_husband” to Columns/
Legend ( Series ), and both “occupation_husband” and “occupation_wife” to Values
respectively after a contingency of 1 categorical appear.
⇒ Contingency tables for each of the categorical/dummy variables :

Comments:

The contingency table illustrates the relationship between husbands' and wives' occupations,
revealing significant trends. Notably, a considerable number of couples share similar
professional roles, such as "Nhan vien van phong," indicating a prevalent trend of
dual-income households. This pairing suggests enhanced economic stability, as both partners
contribute to household income. The data also reflects broader social dynamics, where
educational attainment correlates with employment opportunities. Overall, the table
underscores the importance of occupational relationships in understanding household
financial security, highlighting the need for further research into the impact of these pairings
on overall economic well-being.

C. Make a histogram of expense


Step 1: Copy the expense column →Sort smallest to largest ( Cell 147 has no data.)
Largest = 180, smallest = 1, n = 490 data values
Step 2: Determine the Number of Bins and Bin Width Using Sturges' Rule:
K=1 + 3.3log(n) = 1 + 3.3log(490) = 10
So we choose 10 bins
Bin width = (X(max)- X(min))/ K = (180-1)/10 = 18
So Bin width = 18
Step 3: Go to Insert tab → Chart group → Choose Histogram → Right click on Horizontal
Axis→ Choose format Axis → Change the number of bins: 10 and bin width: 18 → Change
name chart to HISTOGRAM OF EXPENSE

• The histogram of Expense looks like this:

D. Make scatter plots for expense and each of the continuous variables. From the scatter
plots, point out the outliers.
Step 1: Define continuous variables: thunhap, age_wife, age_husband, expense
Step 2: Copy the expense and income column → Go to Insert tab → Chart group → Choose
Scatter. Do the same with the Expense and Age_wife column, and the Expense and
Age_husband column.
● The Scatter Plot: Expense vs Income looks like this:

In the Scatter Plot of Expense and Income, most values are distributed around or
under 50. Only 1 outlier exists where both the expense and income values exceed 150.

● The Scatter Plot: Expense vs Age of wife looks like this:

In the Scatter Plot of Expense and Age of wife, it is evident from the scatter plot of
age_wife and expense that the values are fairly evenly distributed across the age range. On the
age-wife axis, there is only one outlier; on the expense axis, there are two.
● The Scatter Plot: Expense vs Age of husband looks like this:

In the Scatter Plot of Expense and Age of husband, it is spread along the x-axis. Like with
the Scatter Plot of Expense and Age of husband, there are two outliers for the expense values.

You might also like