MATH8009 Assessment 2
Project Description: As a data science consultant, you have been asked by an automotive company
to review a dataset related to vehicle engine specifications and then write a report on your findings.
The dataset contains data on 500 vehicle engines. The dataset variables are summarised in the table
below.
Variable Name Description (Units of Measurement)
ID Vehicle identification number
Power Power of vehicle's engine (horsepower)
Hours Time to manufacture vehicle (hours)
Exp Years of experience of vehicle designer (years)
Eff Efficiency of the vehicle's fuel consumption (miles per gallon)
Type Vehicle (0 = jeep, 1 = car)
The company is interested in the relationship (if any) between the power of the engines and the
other continuous variables.
Report Guidelines
• Use the subsequent questions to structure your report and address all questions clearly and
concisely. You may include other findings if they are relevant to your discussion.
• There is no required minimum page limit, but your report should address all questions.
• A small percentage of the total marks are allocated for presentation and clarity.
• Where appropriate, use graphs and tables to assist in answering the questions. All figures and
tables should be discussed. Label each table and figure so that they can be referred to in the
text.
• Do not include any R code in your report.
MATH8009 Assessment 2
Part 1: Introduction
• Provide a brief description of the aim of your analysis. Identify the independent variables
and dependent variable. Describe what you are investigating in the context of the variables.
Part 2: Descriptive Statistics
(a) Provide summary statistics for the power and hours variables. Based on their values, which
measures of centrality and variability would you use to describe the distribution of the two
variables. Explain your answer.
(b) Provide a histogram and boxplot for the experience variable. Based on the shapes of the
graphs, which measures of centrality and variability would you use to describe the
distribution of the experience variable? Explain your answer.
(c) What proportion of vehicles are jeeps? What proportion of vehicles have an engine with
over 300 horsepower? What proportion of jeeps have an engine with over 300 horsepower?
(d) Provide a boxplot for horsepower by vehicle type and comment on the relationship (if any)
between horsepower and vehicle type.
Part 3: Regression Analysis
(a) Include the following scatter plots and comment on the relationships (if any) between power
and the variables:
1. Power versus Hours
2. Power versus Exp
3. Power versus Eff
(b) State the correlation coefficients for the above three relationships and interpret their values.
(c) State the following linear regression models for power including the appropriate regression
coefficients. Use appropriate variable names relevant to the data.
1. Model 1: Hours as the independent variable
2. Model 2: Exp as the independent variable
3. Model 3: Eff as the independent variable
(d) Show the regression lines for Model 1, Model 2, and Model 3 on the associated scatter plots.
MATH8009 Assessment 2
(e) Use Model 3 to predict the horsepower of two engines with efficiencies of 15.5 miles per
gallon and 28.4 miles per gallon. Which of these two predictions of horsepower are more
accurate? Explain your answer.
(f) For Model 1 explain the regression intercept and slope in the context of the data.
(g) For Model 2 explain the hypothesis tests associated with the regression intercept and slope.
For both tests, state the null and alternative hypotheses. State the relevant p-values and
hence state the decisions of the hypothesis tests.
(h) For Model 3 explain the F-test in the associated R summary output. State the null and
alternative hypothesis. Explain the relevant p-values and hence state the decision of the
hypothesis test.
(i) Which of the three simple linear regression models (i.e., Model 1, Model 2, or Model 3)
would you use to predict horsepower? Provide at least three reasons to explain your answer.
Use summary output from R to justify your answer. It may be helpful to present your results
in a table.
(j) Create a multiple linear regression model that includes the hours, exp, and eff variables.
1. State the multiple linear regression model with the appropriate coefficients.
2. Use the multiple linear regression model to predict the horsepower of an engine
which took 30 hours to manufacture, designed by a car designer with 10 years’
experience and has an efficiency of 25 miles per gallon.
3. Using relevant R summary output, is this multiple linear regression a better model
than the simple linear regression model that you selected in the previous question?
Explain your answer.
4. How could the proposed multiple linear regression model be improved?
Part 4: Summary
• Provide a short summary highlighting the main results of your analysis. Discuss the
implications of your findings.