Linear Regression
Linear Regression
fw,b (x) = wx + b
Linear Regression 1
m
∑(y^i − yi )2
i=1
It’s important to compute this term for different training examples i in the training
set, up until the number of total training examples m. As the number of training
examples increases, it can be more efficient to calculate the average square
error, rather than the total square error. The cost function J , also called the
squared error cost function, is defined as,
m
1
J (w, b) = ∑(y^i − yi )2
2m
i=1
The cost function most engineers use actually divides by 2m, because it helps
make calculations further into the process neater. Withholding this extra two is
perfectly acceptable. While different applications use different types of cost
functions, squared error cost functions are easily the most common type for
regression problems.
To get a better sense of the parameters that are adjusted in the cost function, it
can be written as,
m
1
J (w, b) = ∑(fw,b (xi ) − yi )2
2m
i=1
The only difference here is the predicted output y^ is represented as the function
model fw,b (x ). Because the the function model contains the only parameters
i
that can be adjusted (slope and y-intercept), this is the complete cost function
definition.
minimize J (w, b)
w,b
Linear Regression 2
2D (contour) and 3D (surface) graphs.
m
∂ 1
b = b − α J (w, b) ⇒ ∑(fw,b (xi ) − yi )
∂b m
i=1
Linear Regression 3
Final gradient descent algorithm
m
1
w = w − α ∑(fw,b (xi ) − yi )xi
m
i=1
m
1
b = b − α ∑(fw,b (xi ) − yi )
m
i=1
Notation
When adding more input features, slight modifications to the notation are
necessary,
xj = j th feature
n = number of features
xi = features of ith training example
xj i = value of feature j in ith training example
Model
With the addition of multiple variables comes the necessary updated to the model
functions. Consider the following changes, where n refers to the number of input
features.
To simply the notation, the multivariable model can be written with vectors. Notice
how b is not included in the vectors, as it is just a number and not a parameter.
Linear Regression 4
w = [w1 w2 w3 … wn ]
x = [x1 x2 x3 … xn ]
f w ,b (x) = w ⋅ x + b
Vectorization
Writing vectorized code is essential for turning these linear algebra equations into
something a computer can read and process. It also so happens that GPUs are
very efficient for running vectorized code. Consider the following example, where
the parameters and features are transformed into code.
Notice how the linear algebra count is 1 index, while the Python code is 0
indexed. The Python code utilizes the widely used NumPy package to implement
linear algebra tools, among many others. Without NumPy, the arrow multiplication
between w and x would have to be hardcoded, which could be especially
problematic when running into a high value of n.
n=3
w = np.array([1.0, 2.5, -3.3])
w = [w1 w2 w3 ] b = 4
x = np.array([10, 20, 30])
b is a number
x = [x1 x2 x3 ]
fw ,b (x) = (∑ wj xj ) + b
n
f = 0
for j in range(n): # 0 to n-1
j=1 f += w[j] * x[j]
f += b
Linear Regression 5
f w ,b (x) = w ⋅ x + b f = np.dot(w,x) + b
The benefits of using vectorization are not only cleaner and simpler code, but that
the performance is actually much faster. This is because NumPy uses parallel
hardware in the computer.
Consider the example above. Without vectorization, the for loop runs linearly, by
adding the product of w and x for each parameter j , step after step: f += w[j] *
x[j]
However, with vectorization, the products for w and x for each parameter j are
computed in parallel. Using special hardware, these products are then added
together to give the final result for f.
Notice how b does not require utilizing n features, because b is a value that does
not change based on the parameter being used. Also, the second xi in the first
Linear Regression 6
equation for w does not use vector notation, and this is because that variable is
dependent on the current row j .
m n
1 λ
J (w , b) = ∑(fw ,b (xi ) − yi )2 + ∑ wj2
2m 2m
i=1 j=1
∂ ∂
wj = wj − α J (w , b) b= b−α J (w , b)
∂wj ∂b
m
∂ 1 λ
∑(fw ,b (xi ) − yi )xj + wj
(i)
J (w , b) =
∂wj m i=1 m
m
∂ 1
J (w , b) = ∑(fw ,b (xi ) − yi )
∂b m i=1
m
1
b = b − α ∑(fw ,b (xi ) − yi )
m i=1
Linear Regression 7