0% found this document useful (0 votes)

32 views21 pages

SB Assignment 1 (Group 68)

The document outlines a group assignment cover sheet for a statistics project involving four students, detailing their personal information, unit details, and assignment specifics. It includes a comprehensive analysis of data cleaning, summary statistics, and visualizations such as histograms and scatter plots related to wage, age, and education of US workers, as well as household data from HCMC. The assignment emphasizes the importance of accurate data representation and analysis in understanding economic conditions and social dynamics.

Uploaded by

Hong Phuc Nguyen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views21 pages

SB Assignment 1 (Group 68)

Uploaded by

Hong Phuc Nguyen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

GROUP ASSIGNMENT COVER SHEET

STUDENT DETAILS

Student name: Nguyễn Hồng Phúc Student ID number: 24007192

Student name: Nguyễn Vĩ Phong Student ID number: 23005714

Student name: Trương Trung Kiên Student ID number: 24006810

Student name: Nguyễn Trọng Bảo Student ID number: 24007721

Student name: Student ID number:

UNIT AND TUTORIAL DETAILS

Unit name: Statistics for Business Unit number: SB - T12425PWB - 2

Wednesday
Tutorial/Lecture: Statistic Report Class day and time: 12:00 - 15:15
Lecturer or Tutor name: Võ Đức Hoàng Vũ
ASSIGNMENT DETAILS

Title: Assignment 1 - Group project

Length: 1826 words Due date: 03/12/2024 Date submitted: 03/12/2024

DECLARATION
x I hold a copy of this assignment if the original is lost or damaged.

x I hereby certify that no part of this assignment or product has been copied from any other student’s work or
from any other source except where due acknowledgement is made in the assignment.
x I hereby certify that no part of this assignment or product has been submitted by me in another
(previous or current) assessment, except where appropriately referenced, and with prior permission
from the Lecturer / Tutor / Unit Coordinator for this unit.
x No part of the assignment/product has been written/ produced for me by any other person except
where collaboration has been authorised by the Lecturer / Tutor /Unit Coordinator concerned.
x I am aware that this work may be reproduced and submitted to plagiarism detection software programs for
the purpose of detecting possible plagiarism (which may retain a copy on its database for future
plagiarism checking).

Student’s signature: HONG PHUC

Student’s signature: VI PHONG
Student’s signature: TRUNG KIEN
Student’s signature: TRONG BAO
Student’s signature:

Note: An examiner or lecturer / tutor has the right to not mark this assignment if the above declaration has not
been signed.
ASSIGNMENT 1
Group 68
UEH - International School Business
SB-T12425PWB-2
Mr. Vo Duc Hoang Vu
December 03, 2024

Group Members
Nguyễn Hồng Phúc
Nguyễn Vĩ Phong
Trương Trung Kiên
Nguyễn Trọng Bảo
Question 1:
1.1 Clean data:
- First up, copy the data over to a new sheet to preserve the original data.
+ Step 1: Press the + at the bottom right corner of Excel to create a new
worksheet and name it “Clean_Data”

+ Step 2: Sheet1 → Ctrl + A to select all → Ctrl + C to copy → Clean_Data →

Ctrl + V to paste.
- Remove blanks:
+ Step 1: Select data → Ctrl + G to open Go To pop up → Special → Blanks →
Ok

+ Step 2: Home tab → Cells group → Delete → Delete Sheet Row

- Remove duplicates:
- There are upwards of 500 pairs of data that are completely similar in every single
column, which is highly illogical, therefore it’s best if we remove the duplicates to
ensure the accuracy of the data.
+ Step 1: Select data → Data tab→ Data tools → Remove duplicates → Ok
- The aim of this data is most likely to survey the workers of the US to figure out
the general jobs situation there. This would include people who work
high-paying jobs (CEO, etc) or people who stay and work for a single employer
for a long time. Therefore, we shouldn’t eliminate the outliers as it will make our
data more biased towards middle to low-class workers.
- After cleaning the data, 18271 entries were left.

1.2 Make summary statistics for all continuous variables (mean, mode, median, min,
max, standard deviation), and Utilizing Formulas to Compute Statistical Measures for
Various Variables
- Continuous variables include: wage
- Compute Statistical Measures for Various Variables using the following Excel
formulas:
+ Mean (=AVERAGE(C2:C18272))
+ Median (=MEDIAN(C2:C18272))
+ Mode(=MODE.SNGL(C2:C18272))
+ Standard deviation (=STDEV.S(C2:C18272))
+ Minimum (=MIN(C2:C18272))
+ Maximum (=MAX(C2:C18272))
- Comments (see below for picture):

+ Wage:
● The mean wage is $20.11/hour, which suggests that the data set represents the average
to high-income class, making more than the US minimum wage (which currently is
$7.25/hour).
● The median wage is $15.63/hour, lower than the mean. This shows that the data is
right-skewed, with fewer people making high wages and more people making low
wages.
● Supporting the median is the mode wage of $10/hour, showing the wage that has the
most earners, as well as indicating that the majority of workers in the data set are
making just slightly more than minimum wage.
● The standard deviation for wages is 19.47, which is really high. The variation between
people who work low-income jobs and people with high ones would, as such, be big.
● The minimum wage is $5/hour, and the maximum wage is $491/hour, showing the
massive difference between wages, thus once again reinforcing the high standard
deviation. Furthermore, the data shows that there are, in fact, people working under
minimum wage as well as those working extremely high-paying jobs, pulling the
mean wage up.
1.3. Create histogram for wage, edu, and age. Give your comments on the shape of these
histograms.

● Step 1: Select the "wage" column in the sheet "Clean_Data" → Click Insert → Click
the symbol "Statistical" → Choose the first chart "Histogram"
● Step 2: Right-click on the chart → Select format data series → Change the parameters
of bin width, numbers of bins
● Step 3: Based on the Sturges’ Rule to choose number of bins:
+ k (bins) = 1+ 3.3 x log(n) (where n: sample size, in this datasheet we have n = 18271
after ‘clean data’ stage)
=> k = 1+3.3 x log(18271) = 15.06 ≈ 15
=> We choose 15 for the number of bins
+ Next, for the ‘bin width’, after choosing the appropriate k, we pick up 5.0 as the nice
bin width
Finally, we have histogram for wage of US workers

- Comments:
The distribution of wages in the United States illustrates that the majority of workers
earn between $5 and $25 per hour. The modal wage is around $11 per hour, with only
a small percentage of workers earning more than $50 per hour. Due to the
right-skewed histogram, most values are focused on the left side of the graph. Besides,
the data is derived from a survey conducted among 18,021 American workers after
cleaning data, indicating a significant wage disparity in the United States – while most
workers earn relatively modest wages, a small segment of the workforce commands
exceptionally high hourly rates.
● We do the same three above steps for ‘age’ and ‘edu’ respectively
● Histogram for age of US workers:

- We adjust the number of bins to 13

- Based on the formula of bin limits:
𝑋 𝑚𝑎𝑥 − 𝑋 𝑚𝑖𝑛 59−21
Bin limits = 𝑘
= 13
= 2. 9 ≃ 3. 0
- Comments:
The U.S. workforce age distribution primarily ranges from 21 to 51 years old, peaking
in the 36-46 age group. Workers in the 51-61 age group, which is the oldest, make up
a smaller portion of the workforce.
● Histogram for education of US workers:

- We adjust the number of bins to 17

- Based on the formula of bin limits:
𝑋 𝑚𝑎𝑥 − 𝑋 𝑚𝑖𝑛 17−0
Bin limits = 𝑘
= 17
= 1. 0
- Comments:
The US workers' education levels typically range from 10 to 17 years, with a
left-skewed distribution, which means that the majority of workers have completed
more than 12 years of education. Though 12 years of school attendance is the most
frequent level, a large number of workers have pursued higher education. Only a small
percentage of US employees spent less than 10 years of schooling.
1.4. Make a table of frequency for each of the dummy variables (categorical variables)
defined above and give your comments.

- There are two dummy variables in this dataset: male and race. Therefore, we will
make a frequency table for each of them respectively.

+ Step 1: Select the column D (male) → Click Insert → Click Pivot Chart to open the
"Create PivotChart" box → Press OK.
+ Step 2: On the new sheet, in the PivotTable Fields located on the right, click on 'male' and
under the Field Name section. Then drag ‘male’ to the ‘Rows’ section.

+ Step 3: In the Values section, click , Select "Count”, and press OK.

● Finally, we have the frequency table for the dummy variable ‘male’

In this frequency table, there are 9,119 male people and 9,152 females which relatively share
an equal proportion.
● We do the same steps to create the frequency table for ‘race’

In this frequency table, white individuals have the highest proportion with 12,268 people,
followed by black individuals with 4,097 people, and other races have the smallest proportion with
1,656 people.

Question 2 (50 marks):

Using mclfull.xlsx to answer this question. Data for this question is stored in mclfull.xlsx,
including 491 households in HCMC in 2020.

§ expense: household expenditure (mil. VND/month)

§ income: household monthly income (mil. VND/month)

§ age_wife: age of the wife (or female partner)

§ age_husband: age of the husband (or male partner)

§ occupation_wife: occupation of the wife (or female partner)

§ occupation_husband: occupation of the husband (or male partner)

§ hhsize: Household size (members)

§ children: % children in the household

a. make a table of summary statistics for continuous variables,

b. produce contingency tables for each of the categorical/dummy variables,

c. make a histogram of expense,

d. Make scatter plots for expense and each of the continuous variables. From the scatter
plots, point out the outliers.
A. Make a table of summary statistics for continuous variables

1/ Change “thunhap” to “income”.

2/ Go to “File” → “Options” → “ Add-Ins” → In the “ Manage” box, make sure to set “

Excel Add-Ins” → “Go” → Choose “ Analysis ToolPak” → “OK”

3/ Select “income”; “age_husband”; “expense”; “children” columns.

4/Choose: Data tab → “Choose Data Analysis” → Select “Descriptive Statistics” →Add: “
$A$1:$D$492” to Input Range; “Labels in first row”; “Summary statistics”; Choose random
for Output Range ( ex: $G$1) → Click “OK”
⇒ A table of summary statistics for continuous variables:

B. Produce contingency tables for each of the categorical/dummy variables

1/ Choose “Insert tab” → “PivotChart” → “PivotTable” → “OK”

2/ In PivotChart Fields, choose “occupation_husband” and “occupation_wife”; Then drag

“occupation_wife” to Columns/ Legend ( Series ), “occupation_husband” to Columns/
Legend ( Series ), and both “occupation_husband” and “occupation_wife” to Values
respectively after a contingency of 1 categorical appear.
⇒ Contingency tables for each of the categorical/dummy variables :

Comments:

The contingency table illustrates the relationship between husbands' and wives' occupations,
revealing significant trends. Notably, a considerable number of couples share similar
professional roles, such as "Nhan vien van phong," indicating a prevalent trend of
dual-income households. This pairing suggests enhanced economic stability, as both partners
contribute to household income. The data also reflects broader social dynamics, where
educational attainment correlates with employment opportunities. Overall, the table
underscores the importance of occupational relationships in understanding household
financial security, highlighting the need for further research into the impact of these pairings
on overall economic well-being.

C. Make a histogram of expense

Step 1: Copy the expense column →Sort smallest to largest ( Cell 147 has no data.)
Largest = 180, smallest = 1, n = 490 data values
Step 2: Determine the Number of Bins and Bin Width Using Sturges' Rule:
K=1 + 3.3log(n) = 1 + 3.3log(490) = 10
So we choose 10 bins
Bin width = (X(max)- X(min))/ K = (180-1)/10 = 18
So Bin width = 18
Step 3: Go to Insert tab → Chart group → Choose Histogram → Right click on Horizontal
Axis→ Choose format Axis → Change the number of bins: 10 and bin width: 18 → Change
name chart to HISTOGRAM OF EXPENSE

• The histogram of Expense looks like this:

D. Make scatter plots for expense and each of the continuous variables. From the scatter
plots, point out the outliers.
Step 1: Define continuous variables: thunhap, age_wife, age_husband, expense
Step 2: Copy the expense and income column → Go to Insert tab → Chart group → Choose
Scatter. Do the same with the Expense and Age_wife column, and the Expense and
Age_husband column.
● The Scatter Plot: Expense vs Income looks like this:

In the Scatter Plot of Expense and Income, most values are distributed around or
under 50. Only 1 outlier exists where both the expense and income values exceed 150.

● The Scatter Plot: Expense vs Age of wife looks like this:

In the Scatter Plot of Expense and Age of wife, it is evident from the scatter plot of
age_wife and expense that the values are fairly evenly distributed across the age range. On the
age-wife axis, there is only one outlier; on the expense axis, there are two.
● The Scatter Plot: Expense vs Age of husband looks like this:

In the Scatter Plot of Expense and Age of husband, it is spread along the x-axis. Like with
the Scatter Plot of Expense and Age of husband, there are two outliers for the expense values.

01 - Describing and Summarizing Data
No ratings yet
01 - Describing and Summarizing Data
41 pages
Returns To Education: Chapter 1: Defining and Collecting Data
100% (1)
Returns To Education: Chapter 1: Defining and Collecting Data
13 pages
MBA Starting Salaries - Class Demonstration
No ratings yet
MBA Starting Salaries - Class Demonstration
26 pages
(Sb-t22324pwb-4) Group 2 - Group Assignment
No ratings yet
(Sb-t22324pwb-4) Group 2 - Group Assignment
21 pages
Group Assignment SB
No ratings yet
Group Assignment SB
42 pages
R Working Materials Prep
No ratings yet
R Working Materials Prep
43 pages
R Working Manuals Students
No ratings yet
R Working Manuals Students
11 pages
Test For Normality
No ratings yet
Test For Normality
32 pages
Excel Answers
No ratings yet
Excel Answers
14 pages
Notes
No ratings yet
Notes
44 pages
Excel
No ratings yet
Excel
32 pages
Business Mathematics
No ratings yet
Business Mathematics
61 pages
R Studio Notes
No ratings yet
R Studio Notes
10 pages
Chapter 3 MGSC
No ratings yet
Chapter 3 MGSC
28 pages
D.kavitha Internship Details
No ratings yet
D.kavitha Internship Details
27 pages
ETW1001 Week 3: Pre-Class: A. Tables and Charts For Numerical Data
No ratings yet
ETW1001 Week 3: Pre-Class: A. Tables and Charts For Numerical Data
12 pages
Econ 2b03 Assignment 1
No ratings yet
Econ 2b03 Assignment 1
8 pages
Creating Histograms
No ratings yet
Creating Histograms
8 pages
Excel With Numerics
No ratings yet
Excel With Numerics
20 pages
BDM and Excel-1
No ratings yet
BDM and Excel-1
9 pages
Statistical Analysis: Session 2: Measures of Central Tendency
100% (1)
Statistical Analysis: Session 2: Measures of Central Tendency
41 pages
(IN) Measures
No ratings yet
(IN) Measures
11 pages
Statistics Refresher
No ratings yet
Statistics Refresher
11 pages
Excel Guide For CAES9821 Part 2 - Linear Regression
No ratings yet
Excel Guide For CAES9821 Part 2 - Linear Regression
22 pages
IT Skills
No ratings yet
IT Skills
32 pages
BDA Assignment Aman 19019
No ratings yet
BDA Assignment Aman 19019
38 pages
KTL
No ratings yet
KTL
8 pages
Advanced Excel
No ratings yet
Advanced Excel
48 pages
Screenshot 2024-12-10 at 10.19.32 AM
No ratings yet
Screenshot 2024-12-10 at 10.19.32 AM
29 pages
AY2023 Sem 1 A218 MSA Revision - L01 To L06
No ratings yet
AY2023 Sem 1 A218 MSA Revision - L01 To L06
28 pages
Komputer
No ratings yet
Komputer
14 pages
Excel Lab 2
No ratings yet
Excel Lab 2
7 pages
Homework 3 2
No ratings yet
Homework 3 2
7 pages
Computer Lab - Practical Question Bank Faculty of Commerce, Osmania University
100% (1)
Computer Lab - Practical Question Bank Faculty of Commerce, Osmania University
8 pages
Datascience Session2
No ratings yet
Datascience Session2
10 pages
Lesson 5 - Measures of Variability
No ratings yet
Lesson 5 - Measures of Variability
17 pages
Data Cleaning Process
No ratings yet
Data Cleaning Process
5 pages
Department of Economics Problem Set
No ratings yet
Department of Economics Problem Set
5 pages
Qam Equations
No ratings yet
Qam Equations
3 pages
Fca Msexcel Lab
No ratings yet
Fca Msexcel Lab
8 pages
DR Crsitina Mary Alexander (HRA&M) Virtual Lab Manual
No ratings yet
DR Crsitina Mary Alexander (HRA&M) Virtual Lab Manual
19 pages
Data Management With Voice Over
No ratings yet
Data Management With Voice Over
38 pages
FBR & IT Applications: Compiled and Presented by DR - Deepak Joshi For Academic Use Only
No ratings yet
FBR & IT Applications: Compiled and Presented by DR - Deepak Joshi For Academic Use Only
77 pages
Data Analysis For Business Decisions BBA-SE08 Aditya Singh
No ratings yet
Data Analysis For Business Decisions BBA-SE08 Aditya Singh
2 pages
2.1 Descriptive Statistics (Tabular and Graphical)
No ratings yet
2.1 Descriptive Statistics (Tabular and Graphical)
8 pages
Group Midterm Exam
No ratings yet
Group Midterm Exam
5 pages
Aditya Garg DMDW
No ratings yet
Aditya Garg DMDW
40 pages
Homework 1
No ratings yet
Homework 1
9 pages
BDM All Week Excel
No ratings yet
BDM All Week Excel
57 pages
Assignment 01 BST
No ratings yet
Assignment 01 BST
51 pages
DAV Using Spreadsheet GE Syllabus
No ratings yet
DAV Using Spreadsheet GE Syllabus
6 pages
Workshop 11 - Notes Student Version
No ratings yet
Workshop 11 - Notes Student Version
12 pages
Evans Analytics2e PPT 04
No ratings yet
Evans Analytics2e PPT 04
47 pages
Excel Statistical Formulas
100% (2)
Excel Statistical Formulas
17 pages
Sample: Collect/Find Data Sets That Interest You! Most Students Collect Data From Their Current Job
No ratings yet
Sample: Collect/Find Data Sets That Interest You! Most Students Collect Data From Their Current Job
16 pages
Evans Analytics2e PPT 04 Revised
No ratings yet
Evans Analytics2e PPT 04 Revised
51 pages
R Program Record Book Iba
No ratings yet
R Program Record Book Iba
24 pages
Assignment 1
No ratings yet
Assignment 1
3 pages
Stata Task
No ratings yet
Stata Task
3 pages
Scientific Management of the Classroom
From Everand
Scientific Management of the Classroom
Pernell Hodges
No ratings yet
Lecture 3
No ratings yet
Lecture 3
49 pages
Week 1 Lecture Notes
No ratings yet
Week 1 Lecture Notes
10 pages
Rollan - MODULE 11 FREQUENCY DISTRIBUTION
No ratings yet
Rollan - MODULE 11 FREQUENCY DISTRIBUTION
7 pages
EDUC 202B PRESENTATION OF DATA Chapter 2
No ratings yet
EDUC 202B PRESENTATION OF DATA Chapter 2
47 pages
Unit - III Test No. 1
No ratings yet
Unit - III Test No. 1
5 pages
MBA FPX5008 - Assessment2 1
No ratings yet
MBA FPX5008 - Assessment2 1
11 pages
Probability Cat
No ratings yet
Probability Cat
6 pages
Data 1.1 Presentation
No ratings yet
Data 1.1 Presentation
21 pages
Frequency Disribtion & Graphs
100% (1)
Frequency Disribtion & Graphs
60 pages
Reading 2 Organizing, Visualizing, and Describing Data - Answers
No ratings yet
Reading 2 Organizing, Visualizing, and Describing Data - Answers
51 pages
Stat 231 Printed Notes
No ratings yet
Stat 231 Printed Notes
65 pages
Lecture-2b Chart
No ratings yet
Lecture-2b Chart
6 pages
Chapter IV Data Exploration and Visualization
No ratings yet
Chapter IV Data Exploration and Visualization
3 pages
Maths Booklet Bio and Chem
No ratings yet
Maths Booklet Bio and Chem
52 pages
Community Diagnosis
No ratings yet
Community Diagnosis
15 pages
QT Module 2
No ratings yet
QT Module 2
38 pages
Exponential
No ratings yet
Exponential
54 pages
IFT Notes R02 Organizing, Visualizing, and Describing Data
No ratings yet
IFT Notes R02 Organizing, Visualizing, and Describing Data
51 pages
Ospe Pictures
No ratings yet
Ospe Pictures
53 pages
Unit-2 - Notes
No ratings yet
Unit-2 - Notes
80 pages
STATISTICS
No ratings yet
STATISTICS
25 pages
Statistics For Technology A Course in Applied Statistics Third Edition 3rd Ed Chatfield PDF Download
No ratings yet
Statistics For Technology A Course in Applied Statistics Third Edition 3rd Ed Chatfield PDF Download
78 pages
Lesson 1
100% (1)
Lesson 1
24 pages
Graphical Presentation: Grade 7
No ratings yet
Graphical Presentation: Grade 7
12 pages
Math7 Q4 Week3 Hybrid Version1
No ratings yet
Math7 Q4 Week3 Hybrid Version1
14 pages
Assignment - Organisation of Data
No ratings yet
Assignment - Organisation of Data
4 pages
Biostatistics Course
100% (1)
Biostatistics Course
100 pages
PSYC 340 Statistics and Computing 2 Take Home Exam Spring 2024 Final
No ratings yet
PSYC 340 Statistics and Computing 2 Take Home Exam Spring 2024 Final
17 pages
Wa0003.
No ratings yet
Wa0003.
2 pages
Statistics IC Important Qs 18. ICSE09M Statistics WS 77cd11c1 0d74 4d38 B2af 6033c040a547
No ratings yet
Statistics IC Important Qs 18. ICSE09M Statistics WS 77cd11c1 0d74 4d38 B2af 6033c040a547
3 pages