0% found this document useful (0 votes)
24 views4 pages

APSC 258 Midterm Study Guide

The APSC 258 Midterm Study Guide covers optimization and gradient descent in machine learning, emphasizing the importance of understanding the gradient descent algorithm for training models with large datasets. It explains the use of mean squared error to create a cost function and how gradient descent helps find the global minimum for this function. Additionally, it discusses linear and polynomial regression, including the use of Python's sklearn for implementing these models and the significance of the Bayes Information Criteria for model selection.

Uploaded by

1slickvik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views4 pages

APSC 258 Midterm Study Guide

The APSC 258 Midterm Study Guide covers optimization and gradient descent in machine learning, emphasizing the importance of understanding the gradient descent algorithm for training models with large datasets. It explains the use of mean squared error to create a cost function and how gradient descent helps find the global minimum for this function. Additionally, it discusses linear and polynomial regression, including the use of Python's sklearn for implementing these models and the significance of the Bayes Information Criteria for model selection.

Uploaded by

1slickvik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

APSC 258 Midterm Study Guide

Optimization & Gradient Descent


Readings: Chapter 3 Machine Learning Refined 2nd Edition.

Video: Visually Explained.


Machine learning algorithms use gradient descent to optimize a cost function for it’s global minimum.
This cost function is created by taking

Importance:
It is known that machine learning requires a vast amount of training and test data in order to learn and
predict consistent results correctly with unknown inputs. For example, an AI mapping and categorizing
images. But the science as to how much data to feed the model to predict correct results relies on the
gradient descent learning algorithm. Therefore, it is imperative that it must be understood.
Let’s start with an example, there is an algorithm in charge with sorting waste at a recycling facility. The
model uses images of bottles and cans as returnable containers, plastic bottles as recyclable materials and
other waste as waste for landfill.
Let each variable define a set containing elements that correspond to a
S_1 is a labeled set containing each image and the respective key.
Each subset of s and o represent

Let s1 denote a set of sample images witha⊂: s1= { b1 , c1 , p 1 , w 1 } ,


Let o 1 denote a set of machine outputs witha⊂: o1={ B1 ,C 1 , P1 , W 1 }
Let a 1 denote a set of scaler subsetless answers :a 1={ b1 , c 1 , p1 , w1 } ,

Let the following function M, define the machine answer process: s1 → M [ s n ] → o 1 ,

Let each letter define a ratio of machine correctness: γ ={ β , ς , ρ , ϖ }

{ |B n|
β=
|b1| }{
, ς ∈ R∨ς=
|C n|
|c 1| }{
, ρ∈ R∨ρ=
| Pn|
| p1| }{
, ϖ ∈ R∨ϖ=
|W n|
|w 1| }
The true correct ratio is defined by the set: a 1 which has a specific scaler value for each corresponding
subset. For example: Of the sample set of images, only 27% of the images are plastics, thus: p1=0.27 .

Thus, we can define the correctness of this specific set in terms of the mean squared error, below:
2 2 2 2
MSE=( b−β ) + ( c−ς ) + ( p−ρ ) + ( w−ϖ )
n
Let set : { s n ,a n ∈ U n } → f (|U n|, n ) =∑ MSE (¿ γ i , ai) ¿
i=0
Thus, by taking a sum of each mean squared error of each sample set of labeled data a cost function is
created. However, due to the nature of this cost function representing multidimensional values we can
use gradients to find the minimum where after n number of data sets of |U n| population are least to
best train the model.

Let θ1 =|U n|, θ2=n , η=scaler

The gradient is an operation placed onto any scaler function that yields a vector pointing in the
direction of greatest change.

Let’s say there are 300 images of a cow, pig and chicken distributed as: cow = 133, pig = 129, chicken =
38. Thus, the ideal ratios for each are: 0.4433, 0.43, 0.126.
So if the machine gives the output for one test as: 0.4,0.4,0.2 yielding, 120 cows 120 pigs and 60
chickens. It’s correctness can be placed into a cost function: f() = (0.4 – 0.4433)^2 …
Now by summing the these cost functions we create a multidimensional cost function that is hard to
visualize or find global minimums via simple derivative. So we use the gradient operator to define a
vector at each point that points to the area of least change and it guides us to the minimum. A small step is
taken in that direction and the process interativly repeats. If at a certain point, the algorithm cannot find
an area where the gradient points anywhere else but a point it is now at the minimum and the number of
input data required to train the model is reached.
This is the gradient descent algorithm.
Linear Regression:
Linear regression in machine learning is identical to the statistical equivalent. Where a line of best fit is
defined through the least squares method, where a linear factor and shift are defined through:

Y =αX + β

[ ]
( ∑ x y −∑ x ∑ y )
i i i i
i i i
α= 2
∑ x (∑ x ) 2
i i
i i

α
β=
n ( ∑ y −∑ x )
i
i
i
i

Where i represents the index of each data points and ∑ x represents the sum of all data points in the
i
index of i . While n represents the total number of data points given. Once these values are found the
linear regression is complete. However, python does not require you to preform all these sums and instead
commands lie in “sklearn” that can do the hard work for you as shown below:

From sklearn.linear_model import LinearRegression


#Regression Model
reg= LinearRegression()
X = […]
Y = […]
#fit
reg.fit(X,Y)
reg.predict([[5.5]])

Through the use of a hyperplane where a linear model lies in multiple dimensions, a
linear regression can be preformed on multiple variables or data sets containing the
common output axis. This new linear function is added to the other and makes no
difference if it doesn’t interfere.

Y =α 1 X 1 + β 1+ α 2 X 2 + β 2+ …+α n X n + β n=∑ α i X i+ βi
i=n

Polynomial Regression
A polynomial regression is a linear prediction model with polynomial terms. This means that we can treat
n
it like a multivariable linear regression and substituting X n=( X 1 ) then solving for α n using the same
equation defined above. Polynomial functions are either even or odd which dictate the shape of function,
thus adding more terms of greater order increases the fit level of the regression. If the order is too large,
overfitting can occur (see figure below at order 17) where small errors can change the shape of the curve
easily which can cause our model to be dependent on the labeled data. However if the order is too small
(see figure below at order 2), underfitting occurs where the regression does not reflect the trend of the
overall data. The Bayes Information Criteria function defined for n data points and k model order
(maximum model order you think). If a certain
polynomial model scores the lowest out of all
other possible models on the BIC function, it is
the most appropriate to use for prediction.

[ ]
n
BI C k =nlog ∑ ( y i− p ( x i ) )2 + klog(n)
i=0

Figure 1: The above figure shows the BIC value for each order for the regression, thus 3 is the minimum.

You might also like