GR 1 Report Week 7
GR 1 Report Week 7
Contents
1. Introduction ........................................................................................................................ 1
2. The need for machine learning ............................................................................................ 1
3. Linear regression ................................................................................................................. 1
3.1. Cost function ................................................................................................................ 2
3.2. Gradient descent algorithm .......................................................................................... 2
3.3. Multiple linear regression ............................................................................................. 4
4. Feature engineering ............................................................................................................. 4
5. Polynomial regression ......................................................................................................... 5
1. Introduction
This report will tackle the basics of supervised machine learning, mainly about regression
problems. It delves into fundamental concepts such as linear regression, cost function,
gradient descent, feature scaling and feature engineering.
3. Linear regression
The goal of machine learning is to find a model 𝑓(𝑥) that can roughly fit the training data
and from that model, the machine can predict future outcomes when it encounters data it
has never seen before. A regression problem requires the machine to predict a value from an
infinite amount of choices. The simplest method to tackle regression problems is called linear
regression. In linear regression, this model take the form of:
1
𝑓𝑤,𝑏 (𝑥(𝑖) ) = 𝑤 ∗ 𝑥(𝑖) + 𝑏
𝑤 and 𝑏 are called the weight and bias of the model, while the 𝑖 in 𝑥(𝑖) denotes the 𝑖th
training example.
2
Figure 1: The plot of the cost function 𝐽 (𝑤, 𝑏)
Since the cost function is convex, there exists at most one global minimum. In linear
regression, the minimum can be achieved by starting at a random point (𝑤, 𝑏) and calculate
𝜕 𝜕
the respective partial derivative 𝜕𝑤 𝐽 (𝑤, 𝑏) and 𝜕𝑏 𝐽 (𝑤, 𝑏) at that point and nudge 𝑤 and 𝑏
toward a value where the direction of the slope at that point is going down, which means the
function is heading toward the global minimum. Repeat this step until the cost function is
around the area of the global minimum or the partial derivatives are approaching 0. Here is
how 𝑤 and 𝑏 are calculated at each step:
𝜕
𝑤 = 𝑤 − 𝛼 𝜕𝑤 𝐽 (𝑤, 𝑏) = 𝑤 − 𝛼 𝑛1 ∑𝑛𝑖=1 (𝑓𝑤,𝑏 (𝑥(𝑖) − 𝑦(𝑖) )𝑥(𝑖)
𝜕
𝑏 = 𝑏 − 𝛼 𝜕𝑏 𝐽 (𝑤, 𝑏) = 𝑏 − 𝛼 𝑛1 ∑𝑛𝑖=1 (𝑓𝑤,𝑏 (𝑥(𝑖) − 𝑦(𝑖) )
The 𝛼 here is called a hyperparameter or the learning rate of the model, which is set before
the training starts. Notice that 𝛼 needs to be chosen carefully, if 𝛼 is too small, the model
will take a longer time to train, on the contrary, having an 𝛼 too big and the model will
3
actually get worse overtime as it makes the algorithm overshoots the minimum again and
again.
𝑓𝑤,𝑏 (𝑥)⃗ = 𝑤⃗ ⋅ 𝑥⃗ + 𝑏
By using the vector to represent training data, the machine can use the GPU parallel
computing to efficiently calculate the dot product of 𝑤⃗ and 𝑥⃗ simultaneously.
For gradient descent implementation, there is a slight different with multiple linear
regression, where every features should have its own gradient descent process:
𝜕
𝑤𝑗 = 𝑤𝑗 − 𝛼 𝜕𝑤 𝐽 (𝑤, 𝑏) = 𝑤𝑗 − 𝛼 𝑛1 ∑𝑛𝑖=1 (𝑓𝑤,𝑏 (𝑥(𝑖) −
𝑗
(𝑖)
𝑦(𝑖) )𝑥𝑗 for 𝑗 = 0..(𝑛 − 1)
𝜕
𝑏 = 𝑏 − 𝛼 𝜕𝑏 𝐽 (𝑤, 𝑏) = 𝑏 − 𝛼 𝑛1 ∑𝑛𝑖=1 (𝑓𝑤,𝑏 (𝑥(𝑖) − 𝑦(𝑖) )
4. Feature engineering
Feature engineering is a crucial part of optimization in machine learning. Feature engineering
is the concept of organizing and processing data before training so that the final input X fit
the model best. One key concept of feature engineering is feature scaling, where features are
scaled so that all features have a similar impact on the final state of the model. For example,
a housing price input has 2 features, the size of the house in 𝑚2 ranging from 500 to 2000
and the number of bedrooms ranging from 1 to 5. In this case, the value of the first feature is
4
much higher than the value of the second one, this leads to an uneven influence on the model
where only a little change in the size of the house is already greater than any change in the
number of bedrooms. The model will try to compensate by choosing the appropriate weight
𝑤1 and 𝑤2 so that their influence cancel out, leading to a “skinny” contour plot of 𝑤1 when
compared to 𝑤2 and cause the gradient descent algorithm to run slower since it has a longer
path to go before reaching the global minimum.
𝑥−𝜇
𝑥rescaled = max(𝑥)− min(𝑥)
where 𝜇 is the average value of the feature.
5. Polynomial regression
Sometimes, using linear regression is not enough to fit the input data, in this case,
polynomial regression can be used to fit a curve over the training data instead of just a
straight line. Polynomial regression can enable a model to better fit the data but might cause
5
some issues like calculation overflow since polynomial regression introduces variables with
exponential growth, this makes feature scaling even more important. A general form of
polynomial regression:
2 3 𝑛
𝑓𝑤,𝑏
⃗ (𝑥) = 𝑤1 𝑥 + 𝑤2 𝑥 + 𝑤3 𝑥 + … + 𝑤𝑛 𝑥 + 𝑏