We Have Read and Understand Academic Integrity Policies and Practices and Our Assessment Does Not Violate These
We Have Read and Understand Academic Integrity Policies and Practices and Our Assessment Does Not Violate These
Cover Page
a. What type of survey method the researcher could use and why?
The research question here is more education leads to higher income. The belief can be true or false but may
requires in depth analysis. The survey method can be used by the researcher is questionnaire. Because
different people may have different views in this case. Also, they may have different attributes or factors in
their mind to answer this specific question. In general, the questionnaire is a formal method of data
acquisition, where set of questions prepared to judge the attributes of the respondents such as opinion,
behaviour, attitude, experience etc. The questionnaire can be used to gather qualitative as well as
quantitative data as described in (Bhandari, 2021, p1(1)). Now a days, due to the technology innovations,
online questionnaire is prepared and submitted to different users for the purpose of data collection.
b. What sampling method could the researcher use to select the sample and why?
Population data usually contains large number of observations after completing the data collection process.
Although, we have finite number of records available but might be possible that these may be large number
of. So, in order to perform analysis, the researcher can select appropriate sample. The sample can be a subset
of the large population collected after completion of survey method. The random sampling technique can be
used. In this, random values are assigned to each observation in the population and then selecting top say
thirty smallest random values. The researcher can use computer generated random rand procedure to
perform sampling task as described in (QuestionPro, 2023, p3(1)). Excel provides power full library of
functions to fulfil this kind of requirement. Sampling must be carefully performed as unbiased sample may
be the result. The unbiased result will be from when expected value of sample mean is equal to population
mean.
c. What are the most important variables the researcher should consider collecting data for the purpose
of this analysis and why? Identify the data type(s) for the variables.
In order to judge calibre, despite of taking only qualification, other valuable attributes will also be required
to consider. These attributes will provide detailed in-depth study of candidate’s capacity and efficiently
decide the income as described in (Foundit, 2023, p2(1)).
a. Qualifications – This is the primary factor which makes the person eligible for a particular profile. But
it will not decide the income of the candidate completely. This value is in the form of list i.e. it can
contain list of values.
b. Skills – These are the list of skills which the in which the candidate has strong expertise.
c. Specialized Skill – This will be a skill in which the candidate has higher level of expertise.
d. Expertise Level – This will be a threshold from a given range up to what we can say the candidate has
expertise in the provide specialized skills. The is can be from a range of values example 0 means
minimum and 10 means maximum value. Example 5 means intermediate level.
e. Experience – This will be a value indicating previous years of experience.
f. Certifications – if the candidate has some certifications in the desired field.
g. Estimated Salary – This attribute will decide estimated salary based upon above factors.
For more detail refers to the article (University of Wisconsin-Madison, 2023, p1(3)).
d. What kind of issues the researcher may face in this data collection?
During data collection, the researcher may face several issues some of them are explored below. For more
information refers to (Stedman, 2023, p4(1)).
a. What to collect – what is the requirement of collection. This is the big issue. In other words, what is
required to collect or ask from the audience.
b. Building Questionnaire – the second biggest issue is the data collection. How to prepare the relevant
question set. The questions must be simple, precise easy to understand. Assistance from an expert
level person will be a great advantage.
c. Sufficient amount of observations – The research may face lacking in collecting sufficient of amount
of data. In this case, the process will be repeated for another set of target audience.
d. Quality of data – The researcher may face quality of data, it may be possible that some of the
observations have missing value attributes. These attributes require to clean up first.
e. Real time response – it may be possible that respondents either not responds or not responds on
time. In other words, response delay occurred from the respondent’s side. This will increase time to
collect data and also requires reminders from the source to complete the survey process.
f. Structuring or Formalizing data sets – this is also the biggest challenge for researchers to formalize
the contents of the recorded raw data in such a form so that it can be used for analysis. So, in other
words, data set must be prepared in a structured form.
Part – II
The histogram shown above represents negatively skewed representation. More values or
observations lies on the right-hand side or tail of the histogram. Also, maximum population lies between 10
to 24. Histograms are drawn using excels histogram chart type. For more detail refers to (Bansal, 2023, p3(1)).
In order to show the line segment, an additional series is added to depict the shape. Thus, we can say
maximum population has experience more than or equal to ten.
The histogram represented above depicts the distribution of income values. The graph represents positively
skewed as compare to education graph. This represents more values lies on the left side of the graph.
Therefore, most of the observations comes inside class interval from 15000 to 90000.
b. Descriptive Analysis
Education Variable Descriptive Analysis - The distribution graph of education values shown below, which is
negatively skewed graph. Also, the skewness value of the data is negative. In other words, most of the values
lies on the right-hand side of the graph i.e. at the tail of the graph. 25% fall below first quartile i.e. 14.
Likewise, 15 falls under second quartile i.e. 50%. This quartile also known as median. Qartilie3 represents
75% region. 18 falls below this quartile. Additionally, many other attributes such as minimum, maximum,
count, kurtosis etc. are also shown. Excels data analysis tool pack is used to compute descriptive statistics as
described in (Excel Easy, 2023, p1(1)).
Education(Frequency Distribution
Graph)
No. of Occurrences (Frequency)
400
350
300
250
200
150
100
50
0
0 2 4 6 8 10 12 14 16 18 20 22 24 26
Income Descriptive Analysis
The above distribution graph of income values shows positive skewness i.e. most of the values resides on left
hand side of the graph. Distribution is not divided equally as illustrated above. Other measures, quartile
ranges in distribution are – Q1: 51171 falls below this category. Q2 50% region has 58817.5 below this
category. Likewise, 75% Q3 has 67175.5 falls under this category.
c. Box Plot of Education and Income variables
Boundary Detection
IQR 4 16004.5
Upper Whisker 24 91182.25
Lower Whisker 8 27164.25
In the above, box plot, five measures min, max, quartiles (Q1 and Q2, Q3) are used where q2 is equivalent to
median. Interquartile range, differences are calculated to find out the actual regions. Using these differences,
the box plot is drawn. In order to calculate the whiskers lower and upper limits, we have used some equations
Upper = Q3+1.5*IQR and Lower=Q1-1.5*IQR as described in (Six-Sigma, 2023, p2(5)). To plot box plots, we
have used excels stacked columns charts and then adds as many series as required to the stacked columns.
IQR stands from interquartile range which is the difference between Q3-Q1. For more detail about
calculations refers to (Microsoft, 2023, p1(1)) onwards.
d. identifying outliers from education and income variables
The points which are above to the upper whisker and below to the lower whisker will be treated as outliers
as given in (Six-Sigma, 2023, p2(1)). The outliers are identified using below formula in excel using conditional
expression.
Where, H$44 represents upper limit and H$45 represents lower limit of education whiskers calculated as
described in previous task. Likewise, I$44 represents upper limit i.e. upper whisker and H$45 represents
lower whiskers. Below are some outliers identified
91250
99325
110350
151500
Total of seventeen outliers are detected from the education variable which are also repeatedly occurred in
the data set. Likewise, twenty-one outliers are detected from the income variable. Some of the outliers are
overlapped as because of their values closer to each other or have duplicate value. Thus, these are shown as
overlapped points in the box plot. Similar effects are also shown in income variables, some values near to the
lower whisker and some are shown as overlapped. The above tables and graphs show the outliers displayed
from each variable.
Part III
a. Scatter Plot
Below is the scatter plot, education in years is taken as the x variable and income in $ is taken as the y variable.
x is considered as independent variable and y is dependent variable. Basically, we are going to plot y income
values based upon the x values in order to determine the relationship as described in (Wolber, 2023, p1(1)).
The question “more education leads to the higher income” fits with this data. As you can see, the increase in
education years, the corresponding incomes also increasing. But one thing noticed here is that, there are
many variations for some of the data points. Example, the data point seven, fifteen educational years have
different salary incomes. Likewise, other data points, the same effect has been shown. So, the decision about
the income could not be only on education years but also relies upon some other factors.
Regression Equation
Predicted y=mx+b
Slope Value
m = 3686.455
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.92967
R Square 0.8642863
Adjusted R Square 0.864173
Standard Error 4455.7252
Observations 1200
Coefficients
Regression Equation y = mx + b
Example: Below are some original x and y and predicted y values by linear regression
Original Data
y = 3686.4547 * 17 + 445.16194
y = 63114.892
Regression calculates correlation coefficient as multiple R value in the regression excel sheet. The
interpretation of the correlation here is that it is positive and closer to 1. This shows the strong relationship.
Correlation coefficient value is more than 0.92 indication of the strong relationship. So, we can say in this
data set variable education has strong relationship with income. for more detail refers to (Cheusheva,2023,
p8(1))
R Square 0.864286272
The value R2 represents the proportion of the variance in dependent variable. This value indicates the fitness
of the line i.e. how well the line best fit within the data points. More points lie near to the regression line as
shown in the graph above. R2 provides the strength of the line or in other words R2 tells how well the
independent variable explains the variations in dependent variable as described in (Cheusheva, 2023, p9(1)).
So, in general terms R2 provides measure to determine how well the statistical model predicts. In our model,
more points near to the regression line and hence we can say the model best predicts relationship among
independent and dependent variable.
Part - IV
This assignment requires to understand the strength or relationship among two variables education
and income. The part one related to the data collection activities of the required for the question more
education lead to higher income. This belief can be true or false. All this is dependent upon the data set
collected. The method that can be used to the collect data for this hypothesis is the survey questionnaire.
The questionnaire must be carefully designed in order to capture the behaviour or interest. The
questionnaire must be significantly true and as per our requirement. The survey questionnaire can be
distributed through online mode to cover large number of target audience. For more detail refers to
(Bhandari, 2021, p1(1)). Sampling method that can be used to select the subset of the data set is simple
random sampling method. This method randomly generates the numbers for each observation and then
selects example top (example 30) smallest numbers. This can be used for large number of observations
available. Sometimes, it could be very difficult for the analysts to handle large amount of observations. Rand
procedure can be used to perform this sampling process as published in (QuestionPro, 2023, p3(1)). In order
to answer the question more education leads to higher income. We may think, many other factors could be
required such as qualification, expertise level, specialized skills and others as described earlier. These all may
collectively help to product significant statistical model or analysis. These variables help to predict best
outcomes by contributing their level of interest as given in (University of Wisconsin-Madison, 2023, p1(3)).
In addition to all these, researcher may face many challenges or issues during data collection. Such challenges
may create the barrier in research and needs to carefully examine. The examples of such issues are quality
of collected data, building questionnaire, real time response from respondents, recording sufficient number
of responses, structuring and formalizing data sets etc. The things are difficult to handle but not impossible
and might take time and can be handled successfully. For more detail refers to (Stedman, 2023, p4(1)).
As per the data set available, the histograms are explored further in the report above. The histogram
of the education variable shows most of the values comes on the right-hand side i.e. tail of the graph. This
means shape is negatively skewed or negatively skewed graph. The graph shows that many of the people
have their educational experience greater than or equal to ten. Also, graph shows that maximum population
cover under intervals range 10-24 Likewise, the graph of the income variable is the positively skewed or have
positively skewed shape, in other means, the graph has most of the values on his left side. Also, 45000-90000
intervals range covers maximum population. Descriptive analysis of both variables has been calculated to
discover the relationship among them. Descriptive statistics shows mean, median, mode, skewness and other
relevant details of the data points as described in (Excel Easy, 2023, p1(1)). Skewness of the educational
distribution shows negative value likewise skewness of income variables shows the positive values. We have
separately plotted the distributions of these variables in the graphs further of the histograms. Box plots
depicts quartile ranges and outliers, we have plotted the box plots for both of these variables by firstly
calculating three quartiles of each of them. To box blots, we have used excel stacked column chart types as
a base. Further, the boxplots extended to show the outliers, which are calculated through lower and upper
whiskers. In order to plot the box plots we have used five measures as described above. After then the inter
quartile range, differences are calculated to identify the actual region covered. To identify whiskers, we have
used formulas as Q3+1.5*IQR. Where IQR is the interquartile range. Likewise, for calculation of lower whisker
limit we have used the formula as Q1-1.5*IQR given in (Six-Sigma, 2023, p2(5)). Also, Q2 is basically the
median. Which is also a significant measure used in box plot. The points which are above to the upper whisker
and below to the lower whisker are treated as outliers. Also, note that, some of the data values are same or
duplicate, so because of that such outliers are overlapped in the plots.
In order to extend the research and to find the relationship among the variable education and income
in the dataset, the scatterplot has been plotted as described above in the part 3 as given in (Wolber, 2023,
p1(1)). The plot depicts that as the education years increases, the resultant income value also increases. So,
we can say, most of the points satisfies the questions given “more education leads to higher income”.
Although, there exist some variations on same data point (education year). Example the education year
fifteen has different set of income values. But growing order as compare to previous years. Although, for few
of the data points the assumption is false. In this graph, we have taken education as the independent variable
and income as the dependent variable. Former is represented by x and later is represented by y values.
Further, more advanced statistical model has been built such as linear regression model, the model predicts
the behaviour of the one variable to another as given in (Cheusheva, 2023, p9(1)). As per the model, it has
been proved that many of the data points are closer to the regression line segments. We have used the excels
data analysis package to complete this. The slope is calculated which specifies the angle for the line from x
axis. Intercept describes if line cuts any of the axis. If positive, this means cut to y axis and if negative this
means cut to x axis. Now come to results, the results of the model show that many points are approximated
to the line segment. The correlation coefficient (multiple R value) more than 0.92 shows that positive
strength of the model. Likewise, R2 show that how well the one variable explains the variations of other as
described in (Cheusheva, 2023, p9(1)). R2 basically the square of correlation. Equation of the line is y = mx +
b where b is denoted as intercept and m is slope. X is the original value and y is predicated value. Optionally,
we can add the residual error to get the original value from the equation. As per the model outcome, it has
been significantly true that predicted values are closer to the original values.
References
1. Bansal, S., 2023. How to Make a Histogram in Excel (Step-by-Step Guide). [Online]
Available at: https://trumpexcel.com/histogram-in-excel/
[Accessed 10 Oct 2023].
2. Bhandari, P., 2021. Questionnaire Design | Methods, Question Types & Examples. [Online]
Available at: https://www.scribbr.com/methodology/questionnaire/
[Accessed 11 Oct 2023].
3. Cheusheva, S., 2023. Linear regression analysis in Excel. [Online]
Available at: https://www.ablebits.com/office-addins-blog/linear-regression-analysis-excel/
[Accessed 11 Oct 2023].
4. Excel Easy, 2023. Descriptive Statistics. [Online]
Available at: https://www.excel-easy.com/examples/descriptive-statistics.html
[Accessed 09 Oct 2023].
5. Foundit, T., 2023. Salary Secrets: How Companies determine Your Value. [Online]
Available at: https://www.foundit.in/career-advice/salary-secrets-how-companies-determine-
what-you-are-worth/
[Accessed 11 Oct 2023].
6. Microsoft, 2023. Create a box plot. [Online]
Available at: https://support.microsoft.com/en-au/office/create-a-box-plot-10204530-8cdf-40fe-
a711-2eb9785e510f
[Accessed 10 Oct 2023].
7. QuestionPro, 2023. Simple Random Sampling: Definition and Examples. [Online]
Available at: https://www.questionpro.com/blog/simple-random-sampling/
[Accessed 11 Oct 2023].
8. Six-Sigma, 2023. Box Plot. [Online]
Available at: https://www.six-sigma-material.com/Box-
Plot.html#:~:text=The%20lowest%20point%20of%20the%20lower%20whisker%20is%20called%20t
he,*%20(Q3%2DQ1).
[Accessed 09 Oct 2023].
9. Stedman, G., 2023. data collection. [Online]
Available at: https://www.techtarget.com/searchcio/definition/data-collection
[Accessed 10 Oct 2013].
10. Thomas, L., 2020. Simple Random Sampling | Definition, Steps & Examples. [Online]
Available at: https://www.scribbr.com/methodology/simple-random-sampling/
[Accessed 11 Oct 2023].
11. University of Wisconsin-Madison, 2023. Manager & Supervisor Job Aid:Determining starting salary
of new Hire. [Online]
Available at: https://hr.wisc.edu/docs/recruitment/determining-starting-salary-of-a-new-hire.pdf
[Accessed 11 Oct 2023].
12. Wolber, A., 2023. How to Create a Scatter Plot in Excel. [Online]
Available at: https://www.lifewire.com/create-scatter-plot-in-excel-4689096
[Accessed 10 Oct 2023].