0% found this document useful (0 votes)
5 views

Simulation Notes

Uploaded by

Jerry Yue
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Simulation Notes

Uploaded by

Jerry Yue
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

Simulation Notes

Machine Learning
Introduction to ML
Machine learning is a data-driven approach during which computers learn from existing data to make
predictions often for optimisation purposes.

There are 3 types of machine learning:

- Supervised
- Unsupervised
- Reinforcement

In supervised learning, labelled datasets are provided to algorithms; each dataset is often split into
two subsets for training and testing, respectively (usually 80/20 or 70/30 split). In each subset,
variables are typically labelled as the target variable and predictor variables. The target variable
requires its data-points to be modelled by using the predictor variables. An example would be bank
loan repayment prediction: predictor variables such as initial payment, last payment and credit score
are used to predict the bank’s decision as the target variable.

Datasets are unlabelled in unsupervised learning, the algorithm used must make predictions by
studying the patterns of existing data without any explicit instructions; this means that there is no
necessity to split datasets for training and testing. An example would be performing weather forecast
based on past daily temperature, humidity and weather patterns.

Reinforcement learning is similar to a human trial-and-error process; algorithms are trained to make
optimal decisions in a dynamic environment. Common examples including text prediction, traffic
control and teaching a computer to game. The performance of a reinforcement learning algorithm is
often monitored by tracking its learning curve.
Artificial Neural Networks
Introduction to ANN
A typical artificial neural network (ANN) begins with an input layer and ends with an output layer
with multiple hidden layers in between.

Each layer contains a selection of neurons, the number of neurons in each layer depends on the type of
problem involved and how the output(s) should be deduced based on the inputs. However, the
numbers of neurons generally decrease from the first to the last layer.

The process begins with assigning an activation value (often lies between 0 and 1) to each neuron in
the first input layer. A set of weight values which can be positive or negative corresponds to each
neuron in the following layer.

The magnitudes of the weights are directly proportional to the strengths of inter-neuron connections.
The number of weight values in each set needs to be the same as the number of neurons in the first
layer. The weight sum of all neurons for each set of weight values is calculated in the presence of a
bias value (one bias for each neuron in the following layer)

The bias used to calculate the activation value of a neuron is an indication of whether that neuron is
going to active or inactive in the calculations for the activation values of the neurons in the following
layer (which also emphasizes the goal to be achieved across the layers).

The weighted sums are used as inputs for an activation function that produces outputs within the
range between 0 and 1 (e.g. Sigmoid or ReLU). The outputs then become the activation value of each
corresponding neuron in the following layer. The higher the activation value, the more relevant the
neuron is for determining neurons of interest in the following layer.

The process repeats itself until the output layer is reached and a final decision is made to produce
output, i.e. the neuron with the highest activation value in the output layer is selected. Calculations in
such way can be visualised as matrix multiplication.

When assigning activation values to neurons in the input layer for initialisation, there should be a
criterion for choosing the magnitude of activation value for each neutron depending on how well they
fit with the criterion.

However, the initial process of assigning weight and bias value is usually totally random. As
expected, the quality of the output in early stages is most likely unacceptable.

The degree of accuracy of the output vector produced by each trial is measured by the cost of a single
training example. The cost is calculated as the sum of the squared values of the difference between
the actual activation value of each output neuron and its corresponding desired activation value. The
higher the cost, the lower the degree of accuracy.

NOTE: The desired activation value should be 1 only for the correct output and 0 for the rest.
As a result, appropriate neuron weights and biases are defined as values that will result in the lowest
cost for the output, meaning that the task becomes finding the minima of the function that produces
the cost as the single output with all the weights and biases as its inputs.

The process of searching for local minima is known as gradient descent. The cost function uses the
average cost of all training data of interest to compute the gradient (as a vector with the same number
of elements as the total number of weights and biases in the neural network).

The computation of the gradient for each weight and bias begins with the initial random set of values
for all weights and biases. The gradients are evaluated based on these random values and the cost
function which are then fed back into the cost function to minimise the cost. The process is repeated
until a minimum value is reached.

The magnitudes of the gradients represent how sensitive the cost function is to any changes in each
weight and bias, the larger the magnitude, the higher the sensitivity.

Since the average cost is used in this approach, all training examples should benefit from the
adjustments and result in lower training costs.

Backpropagation is the core algorithm for machine learning in ANN. The goal of any training
example is to maximise the activation value of the desired output neuron and minimise those of
undesired output neurons. Since all output neurons are in the final layer of the network, the algorithm
back-calculates everything.

There are 3 ways to increase the activation value of the desired output neuron:

- Increase the bias


- Increase the weights in proportion to the activation values from the previous layer
- Change the activation values from the previous layer in proportion to the weights
Such process is repeated for all neurons in the output layer; opposite changes need to be applied to
undesired output neurons to minimise their activation values.

The network propagates backwards starting from the output layer until the input layer is reached so
that the desired changes for all weights and biases could be calculated. The average of each weight
and bias is computed over all training data as the negative gradient used in the gradient descent
method.

In practice, the computation time would be too long if the weight and bias nudges were to be
calculated using the whole training dataset as the input. The technique for shortening computation
time is known as stochastic gradient descent.

It begins with randomly shuffling the dataset and diving them into mini batches. A gradient descent
step is computed for each mini batch; the output will not be as accurate meaning that the stepdown is
not as efficient, but it results in a significant computational speedup.

Following the ways proposed to adjust the activation values of output neurons, the computation of the
gradient of the cost function can be shown as the following equations:

[]
(Inner refers to closer to input layer)

∂C
( )
∂ w jk 1
e
∂C
∂ b jk (1 )
1
n−1 e
C= ∑ C i ∇ C= …
n i=0
e
∂C
∂ w jk( l)
e
∂C
(l )
∂b jk

Equation 1. Cost function & Cost gradient

Note that j and k refer to indices of neurons in adjacent layers and they are used repeatedly.

0 ≤ j ≤ Number of neurons ∈the outer layer


0 ≤ k ≤ Number of neurons∈theinner layer
Each element in the cost gradient vector needs to be taken as the average of the costs of all training
examples. The cost of a single training example could be computed using the chain rule; the
computation process begins from the output layer backwards:
n−1 ( l) (l )
∂C 1 ∂ Ci ∂ Ci ∂ zj ∂ a j ∂ Ci
( l)
= ∑
n i=0 ∂ w jk ∂ w jk
(l) (l )
=
∂ w jk ∂ w jk (l) ∂ z j( l) ∂ a j(l )
Equation 2. Cost & outer-weight partial derivative (output layer)

n−1 (l ) (l )
∂C 1 ∂ Ci ∂ Ci ∂ z j ∂ a j ∂ Ci
(l )
= ∑
n i=0 ∂ b jk ∂b jk
(l ) (l )
=
∂ b jk ∂ b jk (l) ∂ z j( l) ∂ a j(l )
Equation 3. Cost & bias partial derivative (output layer)

Each parameter in the chain rule equations could be calculated using the following equations:
nl −1 nl −1
∂ Ci
C i=∑ ( a j − y j) = ∑ 2 ( a j(l )− y j )
(l ) 2
(l )
j=0 ∂aj j=0

Equation 4. Direct calculation of cost

nl =Number of outer neurons


(l )
a j =Outer neuron actual activation value

y j=Outer neuron desired activation value

(l )
∂aj
a j(l )=f a ( z j(l ) ) (l )
=f a ' ( z j(l ))
∂z j
Equation 5. Activation value of output neuron

z j =Weighted ∑ of corresponding neuron f a =Activation function


( l)

nl −1 ( l) nl −1 (l )
∂ zj ∂ zj
z j = ∑ ( w jk ak
( l) (l) ( l−1)
)+b j (l )
= ∑ ( ak ( l−1)
) =1
j ,k=0 ∂ w jk (l ) j ,k=0 ∂b jk (l )
Equation 6. Weighted sum calculation

(l )
w jk =Weight linking inner∧outer neurons
( l−1)
ak = Activation value of inner neuron
(l )
b j =Bias value for computation of outer neuron
By substituting the equations above, the partial derivatives for each cost now become:
nl−1 nl −1
∂C i
( l)
= ∑ ( ak (l −1 )) ( f a ' ( z j(l ) ) ) ∑ 2 ( a j( l)− y j )
∂ w jk j , k=0 j=0

nl−1
∂ Ci
(l )
=( f a ' ( z j ( l)
) ) ∑ 2 ( a j( l)− y j )
∂ b jk j=0

Note that these partial derivatives are only for the computation of the elements associated with the
output layer and its adjacent layer in the gradient vector.

The sensitivity of the cost of a single training example on the activation values of neurons that are one
layer away from the output layer can also be deduced using the chain rule:

( )
nl −1 ( ) ( ) ( ) nl −1
∂C i ∂ z j l ∂ a j l ∂ Ci ∂ zjl
=∑ ∑ ( l)
( l−1 ) ( l −1 ) (l ) (l) ( l −1 )
= w jk
∂ ak j=0 ∂ ak ∂ zj ∂aj ∂ ak j , k=0

Equation 7. Cost & activation value partial derivative

To compute weights linked to neurons closer towards the input layer, the chain rule equations need
to be extended:
( l −1 ) ( l−1 ) ( l) (l )
∂ Ci ∂ zk ∂ ak ∂ zj ∂ a j ∂ Ci
=
∂ w jk( l−1) ∂ w jk (l−1 ) ∂ z k( l−1) ∂ a k (l−1) ∂ z j( l) ∂ a j(l )
Equation 8. Cost & inner-weight partial derivative

( l−1) ( l −1 ) (l ) (l )
∂ Ci ∂ zk ∂ ak ∂z j ∂aj ∂C i
( l−1)
= ( l−1) ( l−1) ( l−1 )
∂bj ∂ bj ∂ zk ∂ ak ∂ z j ∂ a j( l)
(l )

Equation 9. Cost & inner-bias partial derivative

Note that the partial derivatives above need to be calculated for all training examples and the average
is taken for gradient computation.

Eventually, the gradient vector could be fully computed by extending the chain rule equations shown
above in similar sequence for each layer of the network until the input layer is reached.
Tips for Neural Networks
1. Neurons from the preceding layer could be connected to one or more neurons in the current layer,
depending on the purpose of the current neuron. For example, areas of the building and land is a
critical parameter for property evaluation regardless of the circumstances; however, the number of
bedrooms might be more important than the distance to CBD in a city where a larger number of
families with children is present. In that case, connection between the number of bedrooms as an input
neuron to a neuron in the following layer which combines it with the areas of building and land is
more favourable than the connection of the distances to the CBD.

2. Different activation functions could be selected for different neurons in the same layer, depending
on the criteria of "significant impact" of that neuron. For example, the areas of the building and land,
the number of bedrooms along with the age of the building could be combined for property evaluation
and the value depreciates as the building gets older; an appropriate activation function might be
Sigmoid in this case. However, some older buildings with historical significance might be a lot more
expensive than properties with similar attributes in other areas. In that case, a separate neuron
connection should be established just for the age attribute and an appropriate activation function
would be ReLU (i.e. the function is not triggered to boost property value until a threshold age is met).

3. A common problem with gradient descent is that the minima found for the cost function could be a
local minima when the shape of the function is NOT convex (positive parabola) which is very
common for complex optimisation problems. Stochastic gradient descent is therefore employed for
seeking global minima for functions with complicated geometries. This method uses subsets of the
dataset as input for each training example where it adjusts the weights and biases after each run until
the algorithm converges.

Data Preprocessing

Categorical Data Encoding


It is the process of converting categorical (or textual) data into numerical data for the convenience of
algorithm operation as machines work with numbers not strings. There are 2 types of categorical data:
ordinal and nominal.

An ordinal dataset has an inherent order meaning that the data can be ranked from the lowest to the
highest or vice versa; an example would be the level of education attained.

A nominal dataset does not have an inherent order meaning that the data cannot be ranked, e.g.
locations of interest, departments at a tertiary institution.

The choice of encoding method could have a significant impact on the performance of the model
therefore choosing an appropriate method is critical at the beginning stage of the development of a
neural network.

There are several types of categorical data encoding methods:

 One-hot encoding
- Most common method, a binary column is created for each unique category in the variable
- 0 = not present, 1 = present
- Columns are moved to the beginning of the dataset

 Dummy encoding
- Works in the same way as one-hot encoding, uses one less column than
- N categories require N-1 binary columns

 Label encoding
- Each unique category is assigned a unique integer value
- Assigned integers may be misinterpreted as having an order relationship when they do not

 Ordinal encoding
- Used when the categories have a natural order, i.e. dataset is ordinal
- Each category is assigned a numerical value based on their order

 Binary encoding
- Similar to one-hot encoding, but all categories remain in one column
- Each category is represented by a unique set of binary digits

 Count encoding
- All categories remain in one column
- Each category is represented by the number of times that it appears in the variable

 Tagert encoding
- The encoded category must have some connections with the target to be predicted
- Target mean of each category is often calculated as the probability of achieving the target
- The corresponding target mean is assigned to each category

Feature Scaling
It is the process of transforming a dataset to make it fit with a specific scale which will improve the
performance of machine learning. There are 2 common methods: standardisation and normalisation.

The formula for standardisation is:

' X −μ
X=
σ

The formula for normalisation is

' X−X min


X=
X max− X min

Genetic Algorithm
Introduction to Genetic Algorithms
Genetic algorithms (under the class of evolutionary algorithms) are used for solving constrained and
unconstrained optimisation problems based on natural selection:

- The process begins with creating a random selection of solutions as the initial population
which is known as Generation 0. The viability of each solution is calculated by a fitness
function.

- Crossover then occurs for producing offsprings by swapping genetic cords of any 2 solutions;
solutions with higher fitness scores are more likely going to undergo this process.

- A mechanism known as elitism gets triggered before complete generational transformation


occurs. It will keep n best solutions from the previous generation to prevent production of
rubbish generations.

- Mutation is then introduced to randomly alter the solutions for the sake of seeking an optimal
solution. The offsprings generated are known as Generation 1 and the fitness score of each
solution in the new population is calculated.

The process repeats itself until an optimal solution is acquired or the maximum computable number of
generations is reached.

Training NN Models with GA

General Notes on Modelling


Measures of Predictive Model Accuracy
R2 statistics is the most common approach for measuring the accuracy of a predictive model. When
predicted variables (dependent) are compared with observed variables(independent), R 2 is a
measurement of how much of the total variation in the former is explained by the variation in the
latter.

The part of such variation that is NOT explained could be calculated using the fraction shown in the
equation below; therefore R2 is calculated by subtracting the fraction from 1:
n

∑ ( y i−^y i )2
R2=1− i =1
n
, ∈[0 ,1]
∑ ( y i− y i ) 2

i =1

A higher R2 value indicates a good fir for the regressive prediction model, and it is most likely going
to accurately predict responses for future observations. Note that R 2 is also known as the coefficient of
determination; it is the square of the Pearson correlation coefficient for simple linear regression
however they are not directly related for complex multivariate regression.

R2 could become misleading with an increasing number of independent variables (or types of
observed values) as it will either remain unchanged or increase in most cases, irrespective of the
significance of the variable. Adjusted R2 is used to resolve such problem and it can be calculated
using the following equation:

2 ( 1−R2 ) ( N−1 )
Adjusted R =1−
( N −k−1 )
N=Total sample ¿ o . of data points
k =No .of independent variables

Mean squared error (MSE) is used to measure the deviation of predicted variables from observed
variables; it emphasizes larger errors due to squaring and is more sensitive to outliers:
n
1
MSE= ∑ ( y i−^y i ) ,∈ ¿
2

n i=1
In practice, root mean squared error (RMSE) is more commonly used which is shown as below:


n
1
R MSE= ∑ ( y i− ^y i ) , ∈¿
2
n i=1

When the error is small, MSE or RMSE may not be suitable as their values could become too small
for useful comparison due to squaring.

Mean absolute error (MAE) treats all errors equally. Absolute values of errors are taken so that they
do not balance out each other which is similar to squaring in MSE and RMSE. It is less sensitive to
outliers and provides a more balanced view about the performance of the model:
n
1
M A E= ∑ | y i−^y i|,∈ ¿
n i=1

You might also like