Week1 - Introduction To Regression Analysis
Week1 - Introduction To Regression Analysis
In this section, we will elaborate some topics that relate to regression analysis. The elaboration is
mainly on some examples of cases that can be modelled through regression, and insights that can be
obtained in each case. The elaboration will be guided through case by case, with the illustrative data
and triggering questions. At the end of this section, we will establish the data type and data collection
methods for regression, based on the elaborated cases.
To answer this question, consider the following cases. By the end of this section, you should be able
to explain what the regression analysis is.
An undergraduate student in her third year applied for a part time job as a corrector in a book
publisher. She was told that the salary is based on the number of pages she checked. However, she
was not informed how much, on average the payment for each page. She then asked her seniors who
have been doing the same work before, and here is the data she obtained.
a) How can she know what the payment per page is?
b) How much can she expected to gain if she does 200 pages?
Case 2. How much is the salary of a freelance corrector? Case 1 continued.
On further finding the information, the part-timer student tried to process the data she gathered.
When she plotted the data, she obtained something like this.
This student then thought that maybe she missed some details of the information. She then further
asked the seniors who provided the previous information. Then, it was revealed that the payment was
different, depends on what stages of checking was it. If it was on the first draft checking, the payment
is higher since the draft is usually still very rough. The next stages are more refined, so it should be
that the checking process will not be taking much time, and so the payment is lower.
a) How can she know what is the payment per page, for each stage?
b) How much can she expected to gain if she does 200 pages, for each stage?
A study on students’ performance, measured by their GPA was conducted. It is reasonable to assume
that GPA might be explained by the amount of study time. Data on a pilot study on 20 students were
collected, as the following. Study time is number of hours spent on studying, on average, in a week;
GPA is the Grade Point Average of the student at time of observation.
The visualization of the data is as follows.
a) What do you think the relationship between study time and GPA?
b) How can you write the relationship in a mathematical form?
c) How can you plan your aimed GPA based on the information contained in the graph?
d) If you want to achieve a GPA 3.5 or higher, how long (on average) you need to study in a week?
Further thought on factors that might affect GPA, people nowadays, especially the youth generation
(the students, of course) cannot be separated from the using of social media. It is thus reasonable that
they spend much of their times for social media activities, that may affect their study. Thus, additional
data on their time spending on social media was recorded, produced an updated data as the following.
Visualization on the relationship between time spent on social media and GPA is as the following.
a) What do you think the relationship between the time using on social media and GPA?
b) What is your strategy If you aim for a high GPA?
c) What is your guess for the GPA of a student that spend (in a week) 20 hours of studying, and
20 hours on social media?
d) How can you get the close (“accurate”) value for the GPA? How can you model this problem
mathematically?
Case 5. How many eggs can you get?
A farmer was experimenting with the treatment for the chickens in his farm. He wanted to know what
kind of nutrients that is best to be given to the chickens, in order to produce more eggs. Among the
chickens in the farm, he chose some with the similar condition, with respect to the age. He then
combined the nutrients for the chickens; with certain amount of nutrient 1 and nutrient 2. Nutrient 1
is a bit expensive, while nutrient 2 is easily gathered with cheaper price.
He further separated the chickens into four groups, exposing them to different environment set-ups;
where the set-ups were varied by the lighting in the barn, and the time spend in the barn and out in
the open space; giving 4 types of environment exposures. After two weeks, he recorded the number
of eggs that each chicken lays every 2 days. The data is as the following.
Visualization of the data is as the following.
Data type is an attribute of the data that tells how the data is measured, and further, how the data
will be treated in the modelling process that eventually will determine how the result of data
processing will be interpreted.
In general, data is divided into 2 types, namely numerical and categorical data. Numerical data is also
known as quantitative, and categorical data is usually known as qualitative data.
It’s simple. If you can do mathematical operation on the data, such as sum, subtraction, multiplication,
and division, then your data is of numerical type. Otherwise, your data is categorical.
Let say we go shopping, and we buy two shirts; a blue colour with the price of IDR 150,000 and a dark
red colour with the price of IDR 175.000. If we identify the shirts based on the two attributes; price
and colour, then price is the numerical variable, and colour is the categorical variable.
But, in data processing where all data must be stored in numbers, how we treat colour? In this case,
we pick any number to represent the colour. Although you can pick any number, usually we will just
use a simple one, for example 1 for blue colour, and 2 for dark red colour.
Now the colour is represented by number, can’t we do mathematical operation on those numbers?
Technically, you can do it. For example, 1 + 2 = 3, or 2 = 2 × 1. But how are you going to interpret
this number?
What is the meaning of 1 + 2 = 3 for the shirts? Does blue shirt + dark red shirt will result in other
type of shirt? What is the meaning of 2 = 2 × 1? Does it mean dark red shirt is twice of worth than
the blue shirt? Moreover, what if other people, record the same data, but code the colour the other
way around, i.e. 1 for dark red and 2 for blue? So, contextually, you cannot do mathematical
operations on numbers that merely representing some qualitative measurement.
Furthermore, numerical data can be divided into 2 types: interval and ratio. The difference between
the two types that in the absolute zero. There is absolute zero in the ratio scale, while in interval scale
there is no absolute zero. For example, body height is the ratio scale while temperature is an interval
scale. Body height never below zero, height of zero means there is no height. While for temperature,
0 degree Fahrenheit does not mean there is no temperature, 0 degree in Fahrenheit is -17.78 degree
in Celsius, or 0 degree in Kelvin is -273 degree in Celsius. Can you think of other examples of interval
and ratio scales?
Categorical data also can be further divided into 2 types: nominal and ordinal scale. The difference
between the two is on the ordering. Nominal scale does not have the order, while in ordinal scale
order matters. For example, eye colour is a nominal scale, while the opinion on a certain topic: strongly
disagree, disagree, neutral, agree, and strongly agree; is an ordinal scale. Can you think of other
examples of nominal and ordinal scales?
Data types
Categorical Numerical
No Measurements Types
1 number of pages
2 stage
3 payment
4 study type
5 GPA
6 social media time
7 nutrient 1
8 nutrient 2
9 environment
10 number of eggs
Recall again cases examples discussed previously. The way of gathering the data is summarized in the
following table.
For cases 1 – 4, data were collected through questioning or observation on the target subject. While
for case 5, the farmer designed the condition in which he wanted to observe the target measurement.
In cases 1 – 4, we cannot assign the targeted subject to a designated condition, what we did was only
observing the results under the whatever available condition. We also did not have the authority, for
example, to group students based on their intelligence, or age, or semester they were in, and asked
them to spend a certain amount of time for study and for social media, and later at the end semester
we record the target (i.e. GPA). While in case 5, the farmer had the authority to control the condition
for the objects (i.e. the chickens) in which he wanted to observe the results on the target (i.e. number
of eggs).
So, what can you infer about the difference between observational and experimental studies?
In cases 1-4, data were collected based on observation. In case 5, data were collected based on
experimentation. The basic difference between the two is that on the control: in observational study,
the researchers do not have control on the data, while in experimental study the researchers have
control to design, or to manipulate, the condition for data collection.
Furthermore, since there is no designated condition on the observational study, we cannot establish
a causal-effect relationship amongst the variables (or measurements). While for experimental study,
since we can change the condition in which the data are measured, and we also control or minimize
the effect of other factors (for example choosing chicken of the same age in case 5), then it is highly
possible that changes in the targeted response is due to the change in the set-up condition. Thus, for
experimental study, a causal-effect relationship may be established.
Exercise:
2. For each of the study below, explain how you would conduct the data collection, and
determine whether it is an observational or experimental study.
a. Study on the relationship between number of subjects taken in a semester and the
resulting GPA for that semester.
b. Study on assessing the effectiveness of teaching methods between teaching using English
and teaching using Bahasa Indonesia. What measurements will you record? Give an
example of the first four rows of your hypothetical data.
c. Study on advertisement expenditure and company’s income on some products.
d. Study on vouchers offers and customers loyalty of an online transportation platform.
e. Study on breastfeeding and the intelligence of a child.