SB Assignment 1 (Group 68)
SB Assignment 1 (Group 68)
DECLARATION
x I hold a copy of this assignment if the original is lost or damaged.
x I hereby certify that no part of this assignment or product has been copied from any other student’s work or
from any other source except where due acknowledgement is made in the assignment.
x I hereby certify that no part of this assignment or product has been submitted by me in another
(previous or current) assessment, except where appropriately referenced, and with prior permission
from the Lecturer / Tutor / Unit Coordinator for this unit.
x No part of the assignment/product has been written/ produced for me by any other person except
where collaboration has been authorised by the Lecturer / Tutor /Unit Coordinator concerned.
x I am aware that this work may be reproduced and submitted to plagiarism detection software programs for
the purpose of detecting possible plagiarism (which may retain a copy on its database for future
plagiarism checking).
Note: An examiner or lecturer / tutor has the right to not mark this assignment if the above declaration has not
been signed.
ASSIGNMENT 1
Group 68
UEH - International School Business
SB-T12425PWB-2
Mr. Vo Duc Hoang Vu
December 03, 2024
Group Members
Nguyễn Hồng Phúc
Nguyễn Vĩ Phong
Trương Trung Kiên
Nguyễn Trọng Bảo
Question 1:
1.1 Clean data:
- First up, copy the data over to a new sheet to preserve the original data.
+ Step 1: Press the + at the bottom right corner of Excel to create a new
worksheet and name it “Clean_Data”
- Remove duplicates:
- There are upwards of 500 pairs of data that are completely similar in every single
column, which is highly illogical, therefore it’s best if we remove the duplicates to
ensure the accuracy of the data.
+ Step 1: Select data → Data tab→ Data tools → Remove duplicates → Ok
- The aim of this data is most likely to survey the workers of the US to figure out
the general jobs situation there. This would include people who work
high-paying jobs (CEO, etc) or people who stay and work for a single employer
for a long time. Therefore, we shouldn’t eliminate the outliers as it will make our
data more biased towards middle to low-class workers.
- After cleaning the data, 18271 entries were left.
1.2 Make summary statistics for all continuous variables (mean, mode, median, min,
max, standard deviation), and Utilizing Formulas to Compute Statistical Measures for
Various Variables
- Continuous variables include: wage
- Compute Statistical Measures for Various Variables using the following Excel
formulas:
+ Mean (=AVERAGE(C2:C18272))
+ Median (=MEDIAN(C2:C18272))
+ Mode(=MODE.SNGL(C2:C18272))
+ Standard deviation (=STDEV.S(C2:C18272))
+ Minimum (=MIN(C2:C18272))
+ Maximum (=MAX(C2:C18272))
- Comments (see below for picture):
+ Wage:
● The mean wage is $20.11/hour, which suggests that the data set represents the average
to high-income class, making more than the US minimum wage (which currently is
$7.25/hour).
● The median wage is $15.63/hour, lower than the mean. This shows that the data is
right-skewed, with fewer people making high wages and more people making low
wages.
● Supporting the median is the mode wage of $10/hour, showing the wage that has the
most earners, as well as indicating that the majority of workers in the data set are
making just slightly more than minimum wage.
● The standard deviation for wages is 19.47, which is really high. The variation between
people who work low-income jobs and people with high ones would, as such, be big.
● The minimum wage is $5/hour, and the maximum wage is $491/hour, showing the
massive difference between wages, thus once again reinforcing the high standard
deviation. Furthermore, the data shows that there are, in fact, people working under
minimum wage as well as those working extremely high-paying jobs, pulling the
mean wage up.
1.3. Create histogram for wage, edu, and age. Give your comments on the shape of these
histograms.
● Step 1: Select the "wage" column in the sheet "Clean_Data" → Click Insert → Click
the symbol "Statistical" → Choose the first chart "Histogram"
● Step 2: Right-click on the chart → Select format data series → Change the parameters
of bin width, numbers of bins
● Step 3: Based on the Sturges’ Rule to choose number of bins:
+ k (bins) = 1+ 3.3 x log(n) (where n: sample size, in this datasheet we have n = 18271
after ‘clean data’ stage)
=> k = 1+3.3 x log(18271) = 15.06 ≈ 15
=> We choose 15 for the number of bins
+ Next, for the ‘bin width’, after choosing the appropriate k, we pick up 5.0 as the nice
bin width
Finally, we have histogram for wage of US workers
- Comments:
The distribution of wages in the United States illustrates that the majority of workers
earn between $5 and $25 per hour. The modal wage is around $11 per hour, with only
a small percentage of workers earning more than $50 per hour. Due to the
right-skewed histogram, most values are focused on the left side of the graph. Besides,
the data is derived from a survey conducted among 18,021 American workers after
cleaning data, indicating a significant wage disparity in the United States – while most
workers earn relatively modest wages, a small segment of the workforce commands
exceptionally high hourly rates.
● We do the same three above steps for ‘age’ and ‘edu’ respectively
● Histogram for age of US workers:
- There are two dummy variables in this dataset: male and race. Therefore, we will
make a frequency table for each of them respectively.
+ Step 1: Select the column D (male) → Click Insert → Click Pivot Chart to open the
"Create PivotChart" box → Press OK.
+ Step 2: On the new sheet, in the PivotTable Fields located on the right, click on 'male' and
under the Field Name section. Then drag ‘male’ to the ‘Rows’ section.
+ Step 3: In the Values section, click , Select "Count”, and press OK.
● Finally, we have the frequency table for the dummy variable ‘male’
In this frequency table, there are 9,119 male people and 9,152 females which relatively share
an equal proportion.
● We do the same steps to create the frequency table for ‘race’
In this frequency table, white individuals have the highest proportion with 12,268 people,
followed by black individuals with 4,097 people, and other races have the smallest proportion with
1,656 people.
Using mclfull.xlsx to answer this question. Data for this question is stored in mclfull.xlsx,
including 491 households in HCMC in 2020.
d. Make scatter plots for expense and each of the continuous variables. From the scatter
plots, point out the outliers.
A. Make a table of summary statistics for continuous variables
4/Choose: Data tab → “Choose Data Analysis” → Select “Descriptive Statistics” →Add: “
$A$1:$D$492” to Input Range; “Labels in first row”; “Summary statistics”; Choose random
for Output Range ( ex: $G$1) → Click “OK”
⇒ A table of summary statistics for continuous variables:
Comments:
The contingency table illustrates the relationship between husbands' and wives' occupations,
revealing significant trends. Notably, a considerable number of couples share similar
professional roles, such as "Nhan vien van phong," indicating a prevalent trend of
dual-income households. This pairing suggests enhanced economic stability, as both partners
contribute to household income. The data also reflects broader social dynamics, where
educational attainment correlates with employment opportunities. Overall, the table
underscores the importance of occupational relationships in understanding household
financial security, highlighting the need for further research into the impact of these pairings
on overall economic well-being.
D. Make scatter plots for expense and each of the continuous variables. From the scatter
plots, point out the outliers.
Step 1: Define continuous variables: thunhap, age_wife, age_husband, expense
Step 2: Copy the expense and income column → Go to Insert tab → Chart group → Choose
Scatter. Do the same with the Expense and Age_wife column, and the Expense and
Age_husband column.
● The Scatter Plot: Expense vs Income looks like this:
In the Scatter Plot of Expense and Income, most values are distributed around or
under 50. Only 1 outlier exists where both the expense and income values exceed 150.
In the Scatter Plot of Expense and Age of wife, it is evident from the scatter plot of
age_wife and expense that the values are fairly evenly distributed across the age range. On the
age-wife axis, there is only one outlier; on the expense axis, there are two.
● The Scatter Plot: Expense vs Age of husband looks like this:
In the Scatter Plot of Expense and Age of husband, it is spread along the x-axis. Like with
the Scatter Plot of Expense and Age of husband, there are two outliers for the expense values.