Lec 03 - Regresi Linier (Optimized)
Lec 03 - Regresi Linier (Optimized)
25
25
20
20
20
Sales
Sales
Sales
15
15
15
10
10
10
5
5
0 50 100 200 300 0 10 20 30 40 50 0 20 40 60 80 100
TV Radio Newspaper
80
80
70
70
60
60
Income
Income
50
50
40
40
30
30
20
20
10 12 14 16 18 20 22 10 12 14 16 18 20 22
FIGURE 2.2. The Income data set. Left: The red dots are the observed values
of income
§ Mana(in tens
modelof thousands of dollars)
yang lebih and years
baik? Linear education for 30 indi-
ofnon-linier?
atau
viduals. Right: The blue curve represents the true underlying relationship between
income and years of education, which is generally unknown (but is known in
this case because the data were simulated). The black lines represent the error
associated with each observation. Note that some errors are positive (if an ob-
servation lies above the blue curve) and some are negative (if an observation lies
Analisis
16 Regresi
2. Statistical Learning
25
25
25
20
20
20
Sales
Sales
Sales
15
15
15
10
10
10
5
5
0 50 100 200 300 0 10 20 30 40 50 0 20 40 60 80 100
TV Radio Newspaper
FIGURE 2.1. The Advertising data set. The plot displays sales, in thousands
DEFINISI
Analisis
of units, Regresi:
as a function of TV,Perangkat statistikbudgets,
radio, and newspaper yang diguakan
in thousands untuk
of
dollars, for memodelkan hubungan
200 different markets. antara
In each plot we showvariabel-variabel yang
the simple least squares
fit of salessaling berhubungan
to that variable, secara
as described non-deterministik
in Chapter 3. In other words, each blue
line represents a simple model that can be used to predict sales using TV, radio,
and newspaper, respectively. Y = f (X) + ✏
Variabel Output
More generally, Variabel
suppose that we observeInputa quantitative
Error (random)
response Y and p
different predictors, X1 , X2 , . . . , Xp . We assume that there is some
relationship between Y and X = (X , X , . . . , X ), which can be written
Analisis
16 Regresi
2. Statistical Learning
25
25
25
20
20
20
Sales
Sales
Sales
15
15
15
10
10
10
5
5
0 50 100 200 300 0 10 20 30 40 50 0 20 40 60 80 100
TV Radio Newspaper
FIGURE 2.1. The Advertising data set. The plot displays sales, in thousands
Y = f (X) + ✏
of units, as a function of TV, radio, and newspaper budgets, in thousands of
dollars, for 200 different markets. In each plot we show the simple least squares
fit of sales to that variable, as described in Chapter 3. In other words, each blue
Sales model
line represents a simple Xthat
= can
(X1be, X
used, X
2
to predict
3) Error
sales (random)
using TV, radio,
and newspaper, respectively.
25
25
25
20
20
20
Sales
Sales
Sales
15
15
15
10
10
10
5
5
0 50 100 200 300 0 10 20 30 40 50 0 20 40 60 80 100
TV Radio Newspaper
FIGURE 2.1. The Advertising data set. The plot displays sales, in thousands
Y = f (X) + ✏
of units, as a function of TV, radio, and newspaper budgets, in thousands of
dollars, for 200 different markets. In each plot we show the simple least squares
fit of sales to that variable, as described in Chapter 3. In other words, each blue
• Output
line represents a simple model that can be used X
variable to predict , X2 ,using
= (X1sales X3 )TV, radio,
and newspaper, respectively.
• Response (tanggapan)
variable
• Dependent variable • Predictors
• Features (Fitur)
More generally, suppose that we observe a quantitative response Y and p
• Independent variable
different predictors, X , X , . . . , X . We assume that there is some
1 2 p
relationship between Y and X = (X , X , . . . , X ), which can be written
Regresi Linier Sederhana 2.1 What I
Sederhana ?
80
80
§ # Response variable = 1
70
70
§ # Fitur/Predictor =1
60
60
Income
Income
50
50
Linier?
40
40
f (X) = 0 + 1X
30
30
20
20
Persamaan linier /
garis lurus
10 12 14 16 18 20 22 10
Years of Education
80
80
§ # Response variable = 1
70
70
§ # Fitur/Predictor =1
80
80
80
60
60
Income
Income
50
50
70
70
70
Linier?
40
40
f (X) = 0 + 1X
60
60
60
30
30
Income
Income
Income
Income
20
20
50
50
50
10 12 14 16 18 20 22 10
Intercept
40
40
40
Years of Education
Slope
30
30
30
20
20
«1
«2 • Model
Y = 0 + 1X +✏
(x2, y2)
x Variabel acak dengan
mean = 0 dan varian 𝜎 " .
x1 x2
2.3 Points corresponding to observations from the simple linear regression model
• Observations • Prediction / Estimation model
uplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s).
y1 = + 1 x1 + ✏1
earning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
0
Ŷ = ˆ0 + ˆ1 X
y2 = 0 + 1 x2 + ✏2
..
.
yn = 0 + 1 xn + ✏n
Regresi Linier Sederhana (4)
2.1 What Is Statistical Learning? 17
80
Ŷ = ˆ0 + ˆ1 X
70
70
60
60
• Observations
Income
Income
50
50
yi = ˆ0 + ˆ1 xi + ei , i = 1, 2, · · · , n
40
40
(xn , yn )
30
30
Residual
20
20
10 12 14 16 18 20 22 • 10
Residual
12 14 sum
16 18of20squares
22 (RSS)
Years of Education ei Years
= yofi Education
ŷi
GURE 2.2. The Income data set. Left: The redŷdots ˆare theˆ observed values
i = 0 + 1 xi
income (in (x
tens of thousands of dollars) and years of education for 30 indi-X n
1 , y1 )
2 2 2
uals. Right: The blue (x2curve
, y2 ) represents the true underlying
RSS = e1 relationship
+ e2 + · · · +between
en = e2i
ome and years of education, which is generally unknown (but is known ini=1
s case because the data were simulated). The black lines represent the error
ociated with each observation. Note that some errors are positive (if an ob-
vation lies above the blue curve) and some are negative (if an observation lies
mbol, ˆ , to denote the estimated value for an unknown parameter
Regresi Linier Sederhana (5)
fficient, or to denote the predicted value of the response.
2
80
Estimating thedalam
Persoalan Coefficients
regresi linier sederhana:
70
ctice, β0 and β1 are unknown. So before we can use (3.1) to make
§ Dari observasi
60
tions, we must use data to estimate the coefficients. Let
Income
50
(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )
40
ent n observation
denganpairs, each of which consists of a measurement (xn , yn )
30
model prediksi
and a measurement of ˆY . In the Advertising example, this data
20
Ŷ = 0 + ˆ1 X
nsists of the TV advertising budget and product sales in n = 200
nt markets.tentukan
(Recall that ˆ0 dan are displayed in Figure 2.1.)10Our
ˆ1 untuk
the data 12 14 16 18 20 22
meminimalkan
to obtain coefficient estimatesRSS
β̂0 and β̂1 such that the linear model
Years of Education
ˆ0 = ȳ ˆ1 x̄
dengan
n n
1X 1X
ȳ = yi dan x̄ = xi
n i=1 n i=1
Regresi Linier Sederhana (7)
Alternatif perhitungan:
RESULT
ˆ1 = Sxy
Sxx
dengan ! !
n
X n
X
n n
xi yi
X X i=1 i=1
Sxy = (yi ȳ)(xi x̄) = x i yi
i=1 i=1
n
n
!2
X
n n
xi
X X i=1
Sxx = (xi x̄)2 = x2i
i=1 i=1
n
PLE 12.4 The cetane number is a critical property in specifying the ignition quality of a fuel
used in a diesel engine. Determination of this number for a biodiesel fuel is expen-
Contoh perhitungan:
sive and time-consuming. The article “Relating the Cetane Number of Biodiesel
Fuels to Their Fatty Acid Composition: A Critical Study” (J. of Automobile
Engr., 2009: 565–583) included the following data on x 5 iodine value sgd and
§ Hubungan antara
y 5 cetane number for angka
a sampleCetane bahan
of 14 biofuels. Thebakar
iodinebiodisel dan
value is the nilaiof
amount
iodine (gram) :to saturate a sample of 100 g of oil. The article’s authors fit the
iodine necessary
simple linear
𝑥 =regression model to this data, so let’s follow their lead.
nilai iodine
𝑦= angka cetane
x 132.0 129.0 120.0 113.2 105.0 92.0 84.0 83.2 88.4 59.0 80.0 81.5 71.0 69.2
y 46.0 48.0 51.0 52.1 54.0 52.0 59.0 58.7 61.6 64.0 61.4 54.6 58.8 58.0
60
cet num
55
50
45
50 60 70 80 90 100 110 120 130 140
iod val
Figure 12.8 Scatterplot for Example 12.4 with least squares line superimposed, from
Minitab ■
The estimated regression line can immediately be used for two different
3.1.1 Estimating thedalam
Persoalan Coefficien
regr
80
Estimating
Regresithe Coefficients
Linier
Persoalan dalam dengan
regresi MATLAB
linier sederhana:
In practice, β0 and β1 are unknown.
70
ctice, β0 and β1 are unknown. So before we can use (3.1) we to§make Dari observasi
§ Dari observasi predictions, must use data to estim
60
ions, we must use data to estimate the coefficients. Let
Income
(x1 , y1 ), (x2 , y2
50
§ Menggunakan dua fungsi polyfit dan polyval
(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )
40
§ Fungsi polyfit digunakan untuk menentukan represent n observation dan ˆ1 pairs,
ˆ0 dengan model eachpre
nt n observation pairs, each of which consists of a measurement (x , y )
30
§ dengan
Fungsi polyval prediksi untukof
modeldigunakan X and a measurement of ˆY . In
menentukan prediksi
n
Ŷ = 0 + ˆ1 X
n
20
berdasarkan
Ŷ = 0nilai
+ 1 X, 0 , dan 1
ˆ ˆ ˆ
ˆ0 dan ˆ
sists of the TV advertising budget and product different in n = tentukan
sales markets. 200(Recall that the da
§ Buatlah
nt markets. (Recall eksperimen
tentukan ˆ0 dan
that the data dengan
ˆ1 untuk kode
are displayed berikut:
in
goal Figure
is to obtain meminimalkan
2.1.)10coefficient
Our 12 14 16 18 20 22
estimates RS
meminimalkan
to obtain coefficient estimates RSSβ̂0 and β̂1 such (3.1) that the fitslinear
the available
Years of Education
model data well—tha
RSS = e21 + e22
ts the available data well—that is, so that yi1,≈FIGURE
. .β̂.0, n.
+ β̂In1 xother
i forThe
2.2. i =Income
words, wedata
want set.toLeft
fin
RSS = e21 + e22 + · · · + e2n of income (in β̂
tens
n. In other words, we want to find an intercept β̂that 0 and the resulting
a slope (x y1of)line
11 ,such
thousands
is as close of dollars)
as p
e resulting line is as close as possible to the There n viduals.
= 200 Right:
aredata The (garis)
a Regresi
number
points. blue(xofcurve
Linier
2, yways
2)
represents
of meast
income and years of education, which
are a number of ways of measuring closeness.most However, common by far approach
the involves mi
this case because the data were
Least-Squares simulated
ommon approach involves minimizing the least and we
squares take that
criterion, approach
associated with each observation.
in this cha
least squaresNote th
Least-Squares
take that approach Estimation
in this chapter. Alternative considered
approaches
servation in above
lies Chapter
will be 6. curve) and so
the blue
red in Chapter 6. below the curve). Overall, these errors hav
§ Kelompok 1
304 Chapter 7 Linear Regression
ech to de- Arm Dynamic Arm Dynamic Pressure, x (lb/sq in.) Scale Reading, y
ures have Individual Strength, x Lift, y Individual Strength, x Lift, y 10 13
stics of an 1 17.3 71.7 11 28.2 68.3 10 18
bjected to 2 19.3 48.3 12 28.7 96.7 10 16
m a weight- 3 19.5 88.3 13 29.0 76.7 10 15
ifted over- 4 19.7 75.0 14 29.6 78.3 10 20
5 22.9 91.7 15 29.9 60.0 50 86
6 23.1 100.0 16 29.9 71.7 50 90
sion curve
7 26.4 73.3 17 30.3 85.0 50 88
8 26.8 65.0 18 31.3 85.0 50 88
9 27.6 75.0 19 36.0 88.3 50 92
strength). 10 28.1 88.3 20 39.5 100.0
21 40.4 100.0 (a) Find the equation of the regression line.
(cont.)
22 44.3 100.0 (b) The purpose of calibration in this application is to
23 44.6 91.7 estimate pressure from an observed scale reading.
24 50.4 100.0 Estimate the pressure for a scale reading of 54 using
25 55.9 71.7 x̂ = (54 − b0 )/b1 .
§ Kelompok
7.2 The grades of 2
a class of 9 students on a midterm
7.5 A study was made on the amount of
report (x) and on the final examination (y) are as fol-
sugar in a certain process at various tempera
lows:
data were coded and recorded.
x 77 50 71 72 81 94 96 99 67 Nilai UTS
(a) Estimate the linear regression line.
y 82 66 78 34 47 85 99 99 68 Nilai UAS
(b) Estimate the mean amount of converted
duced when the coded temperature is 1
(a) Estimate the linear regression line.
(c) Plot the residuals versus temperature. C
(b) Estimate the final examination grade of a student
who
1. received
Buatlah ascatter
gradeplot
of dari
85 on thedimidterm
data atas report. Temperature, x Converted Su
2. Tentukan persamaan garis regresi linier dari data diatas (Ŷ = 1.0 ˆ0 + ˆ1 X ).
1.1
8.1
7.8
7.3 The amounts of a chemical compound y that dis-
3. inGambarkan
solved 100 grams ofplot garisattersebut
water pada scatter plot soal no.1
various temperatures 1.2 8.5
4. recorded
x were Tentukanas nilai estimasi nilai UAS ketika nilai UTS bernilai 75.
follows: 1.3 9.8
5. Tentukan
x (◦ C) RSS dariy observasi
(grams) dan hasil estimasi. 1.4 9.5
0 8 6 8 1.5 8.9
15 12 10 14 1.6 8.6
30 25 21 24 1.7 10.2
45 31 33 28 1.8 9.3
60 44 39 42 1.9 9.2
75 48 51 44 2.0 10.5
§ Kelompok 5
Exercises
§ Kelompok 8
E11-2 House Data 11-8. Table E11-3 presents the highway
gasoline mileage performance and engine displacement for
Sale Taxes Sale Taxes DaimlerChrysler vehicles for model year 2005 (U.S. Environ-
Price/ (local, school), Price/ (local, school), mental Protection Agency).
1000 county)/1000 1000 county)/1000
25.9 4.9176 30.0 5.0500
𝑥
(a) Fit a simple linear model relating highway miles per gallon (y
to engine displacement ( x ) in cubic inches using least squares
𝑦 29.5 5.0208 36.9 8.2464 (b) Find an estimate of the mean highway gasoline mile
27.9 4.5429 41.9 6.6969 age performance for a car with 150 cubic inches engine
displacement.
25.9 4.5573 40.5 7.7841 (c) Obtain the fitted value of y and the corresponding residua
29.9 5.0597 43.9 9.0384 for a car, the Neon, with an engine displacement of 122
29.9 3.8910 37.5 5.9894 cubic inches.
30.9 5.8980 37.9 7.5422 11-9. An article in the Tappi Journal (March 1986) presented
28.9 5.6039 44.5 8.7951 data on green liquor Na2S concentration (in grams per liter)
and paper machine production (in tons per day). The data (read
35.9 5.8282 37.9 6.0831
from a graph) follow:
31.5 5.3003 38.9 8.3607
31.0 6.2712 36.9 8.1400 y 40 42 49 46 44 48
30.9 5.9592 45.8 9.1416 x 825 830 890 895 890 910
y 46 43 53 52 54 57 58
1. Buatlah scatter plot dari data di atas.
(c) Calculate the fitted value of y corresponding to x = 5.8980 .
Find the corresponding residual. x 915 960 990 1010 1012 1030 1050
2. Tentukan persamaan garis regresi linier dari data diatas (a)(ŶFit=a simple ˆ0 +
(d) Calculate the fitted ŷi for each value of xi used to fit the
model. Then construct a graph of ŷi versus the correspond- 1 X ). model with y = green liquor
linearˆregression
Na S concentration and x = production. Find an estimate
3. Gambarkan plot garis tersebut pada scatter plot soal no.1. least
2
ing observed value yi and comment on what this plot would
of σ . Draw a scatter diagram of the data and the resulting
2
§ Kelompok 9
E11-4 Propellant Data (b) What is the estimate of expected BOD level when the time
is 15 days?
Observation Strength y Age x (c) What change in mean BOD is expected when the time
Number (psi) (weeks) 1. Buatlah scatter plot dari data di atas.
changes by three days?
(d) Suppose that the time used is six days. Calculate the fitted
1
2
2158.70
1678.15
15.50
23.75
2. Tentukan persamaan garis regresi linier dari data diatas
value of y and the corresponding residual.
(e) Calculate the fitted ŷi for each value of xi used to fit the
3 2316.00 8.00 ( Ŷ = ˆ + ˆ X ).
model. Then construct0 a graph1 of ŷi versus the correspond-
4 2061.30 17.00 ing observed values yi and comment on what this plot
5 2207.50 5.00
3. Gambarkan plot garis tersebut pada scatter plot soal no.1.
would look like if the relationship between y and x was a
deterministic (no random error) straight line. Does the plot
6 1708.30 19.00 4. Tentukan nilai estimasi strength ketika age bernilai 10.
actually obtained indicate that time is an effective regressor
7 1784.70 24.00 variable in predicting BOD?
8 2575.00 2.50 5. AnTentukan
11-16. RSS and
article in Wood Science dari observasi
Technology [“Creep dan hasil estimasi.
9 2357.90 7.50 in Chipboard, Part 3: Initial Assessment of the Influence of
Moisture Content and Level of Stressing on Rate of Creep
10 2277.70 11.00
and Time to Failure” (1981, Vol. 15, pp. 125–144)] reported
11 2165.20 13.00 a study of the deflection (mm) of particleboard from stress
12 2399.55 3.75 levels of relative humidity. Assume that the two variables are
related according to the simple linear regression model. The
13 1779.80 25.00
data follow:
14 2336.75 9.75
15 1765.30 22.00 x = Stress level (%): 54 54 61 61 68
1. Buatlah (a)
scatter plotthedari
Estimate datacoefficients.
regression di atas.
(b) Do the data support the claim that systolic blood pressure does not depend
2. Tentukan persamaan
on an individual’sgaris
weight?regresi linier dari data diatas (Ŷ = 0 + 1 X ).
ˆ ˆ
3. Gambarkan
(c) If aplot garis tersebut
large number pada182scatter
of males weighing pounds plot soalblood
have their no.1.pressures
taken, determine an interval that, with 95 percent confidence, will contain
4. Tentukan nilai estimasi
their average blood tekanan
pressure. darah systolic ketika berat badan 230 pound.
5. Tentukan(d) RSS
Analyzedari observasi
the standardized dan hasil estimasi.
residuals.
(e) Determine the sample correlation coefficient.
32. It has been determined that the relation between stress (S) and the number of
cycles to failure (N ) for a particular type of alloy is given by
A