ML Lec-6
ML Lec-6
LECTURE-6
BY
Dr. Ramesh Kumar Thakur
Assistant Professor (II)
School Of Computer Engineering
v In Batch Gradient Descent, all the training data is taken into consideration to take a single step.
v We take the average of the gradients of all the training examples and then use that mean gradient to update
our parameters. So that’s just one step of gradient descent in one epoch.
v In batch Gradient Descent since we are using the entire training set, the parameters will be updated only
once per epoch.
v Batch Gradient Descent is great for convex or relatively smooth error manifolds.
v In this case, we move somewhat directly towards an optimum solution.
v The graph of cost vs epochs is also quite smooth because we are averaging over all the gradients of
training data for a single step. The cost keeps on decreasing over the epochs.
v In Batch Gradient Descent we were considering all the examples for every step of Gradient Descent. But
what if our dataset is very huge.
v Suppose our dataset has 5 million examples, then just to take one step the model will have to calculate the
gradients of all the 5 million examples.
v This does not seem an efficient way. To tackle this problem we have Stochastic Gradient Descent.
v In Stochastic Gradient Descent (SGD), we consider just one example at a time to take a single step.
v Since we are considering just one example at a time the cost will fluctuate over the training examples and
it will not necessarily decrease. But in the long run, you will see the cost decreasing with fluctuations.
v Because the cost is so fluctuating, it will never reach the minima but it will keep dancing around it.
v SGD can be used for larger datasets. It converges faster when the dataset is large as it causes updates to
the parameters more frequently.
v Batch Gradient Descent converges directly to minima. SGD converges faster for larger datasets. But, since
in SGD we use only one example at a time, we cannot implement the vectorized implementation on it.
v This can slow down the computations. To tackle this problem, a mixture of Batch Gradient Descent and
SGD is used.
v Neither we use all the dataset all at once nor we use the single example at a time.
v We use a batch of a fixed number of training examples which is less than the actual dataset and call it a
mini-batch. Doing this helps us achieve the advantages of both the former variants (GD and SGD).
v Just like SGD, the average cost over the epochs in mini-batch gradient descent fluctuates because we are
averaging a small number of examples at a time.
v When we are using the mini-batch gradient descent we are updating our parameters frequently as well as
we can use vectorized implementation for faster computations.
v In practice, we often encounter different types of variables in the same dataset.
v A significant issue is that the range of the variables may differ a lot.
v Using the original scale may put more weights on the variables with a large range.
v
v In order to deal with this problem, we need to apply the technique of features rescaling to independent
variables or features of data in the step of data pre-processing.
v The terms normalisation and standardisation are sometimes used interchangeably, but they usually refer to
different things.
v The goal of applying Feature Scaling is to make sure features are on almost the same scale so that each
feature is equally important and make it easier to process by most ML algorithms
v This is a dataset that contains a dependent variable (Purchased) and 3 independent variables (Country, Age,
and Salary). We can easily notice that the variables are not on the same scale because the range of Age is
from 27 to 50, while the range of Salary going from 48 K to 83 K. The range of Salary is much wider than
the range of Age. This will cause some issues in our models since a lot of machine learning models such
as k-means clustering and nearest neighbour classification are based on the Euclidean Distance.
v When we calculate the equation of Euclidean distance, the number of (x2-x1)² is much bigger than the
number of (y2-y1)² which means the Euclidean distance will be dominated by the salary if we do not
apply feature scaling.
v The difference in Age contributes less to the overall difference.
v Therefore, we should use Feature Scaling to bring all values to the same magnitudes and, thus, solve this
issue.
v To do this, there are primarily two methods called Standardisation and Normalisation.
v The result of standardization (or Z-score normalization) is that the features will be rescaled to ensure the
mean and the standard deviation to be 0 and 1, respectively. The equation is shown below:
v This technique is to re-scale features value is useful for the optimization algorithms, such as gradient
descent, that are used within machine learning algorithms that weight inputs (e.g., regression and neural
networks).
v Rescaling is also used for algorithms that use distance measurements, for example, K-Nearest-Neighbours
(KNN).
v Another common approach is the so-called Max-Min Normalization (Min-Max scaling).
v This technique is to re-scales features with a distribution value between 0 and 1.
v For every feature, the minimum value of that feature gets transformed into 0, and the maximum value gets
transformed into 1. The general equation is shown below:
v In contrast to standardisation, we will obtain smaller standard deviations through the process of Max-Min
Normalisation. Let me illustrate more in this area using the above dataset.
v Max-Min Normalisation typically allows us to transform the data with varying scales so that no specific
dimension will dominate the statistics, and it does not require making a very strong assumption about the
distribution of the data, such as k-nearest neighbours and artificial neural networks.
v However, Normalisation does not treat outliners very well.
v On the contrary, standardisation allows users to better handle the outliers and facilitate convergence for
some computational algorithms like gradient descent.
v Therefore, we usually prefer standardisation over Min-Max Normalisation.