ML Intro
ML Intro
In this module we’ll discuss ideas. No codes , no python [Only for a while though ;) ]. We’ll not get too
technical or mathematical, we’ll keep our discussion focused on concepts. Lets begin.
At the end of the day; whatever we are doing here is about money. To put that more appropriately , we are
developing these techniques to solve business problems. Real business problems do not come conveniently all
dressed up as a data problem in general. For example consider this :
• A bank is making too many losses because of defaulters on retail loans
Its a genuine business problem but it isn’t a data problem yet on the face of it. So, what is a data problem
then. A data problem is a business problem expressed in terms of the data [ potentially ]available in the
business process pipeline.
It majorly has two components:
• Response/Goal/Outcome
• Set of factor/features data which affects our goal/response/outcome
Lets look at the loan default problem and find these components. Outcome is loan default, which we would like
to predict when considering giving loan to a prospect. What factors could help us in doing that. Banks collect
a lot of information on a loan application such as financial data, personal information. In addition to that they
also make queries regarding credit history to various agencies. We could use all these features/information to
predict whether a customer is going to default on their loan or not and the reconsider our decision depending
on the result.
Here are few more business problems which you can try converting to data problems :
* A marketing campaign is causing spam complaints from the existing customers
* Hospitals can not afford to have additional staff all year round
which are needed only when patient intakes become higher than a certain amount
* An ecommerce company wants to know how it should plan for the budget on cloud servers
* How many different election campaign strategies should a party opt for
If you went through the business problems mentioned above, you’d have realized there are mainly three kind
of problems which can then be clubbed into two categories :
1. Supervised
2. Unsupervised
Supervised problems are the problems which have explicit outcome. Such as default on loan, Required
Number of Staff, Server Load etc. Within these , you can see separated kind.
1. Regression
2. Classification
1
Regression problems are those where outcome is a continuous numeric value. Classification problem on the
other hand have their outcome as categories [e.g.: good/bad/worse; yes/no ; 1/0 etc]
Unsupervised problems are those where there is no explicit measurable outcome associated with the problem.
You just need to find general pattern hidden in the measured factors. Finding different customer segments
or electoral segments or finding latent factors in the data comes under such problems. Now categories of
problems can be formally grouped like this:
1. Supervised
1. Regression
2. Classification
2. Unsupervised
In solving a business problem, our primary goal remains to solve it as well as possible. When it comes to
predict outcome or achieve a goal we need to be as accurate as possible with given information/measured
features.
To measure accuracy , we must first define error. To take a simple example, lets consider a scenario where we
have information on temperature and humidity today and we are trying to predict how much will it rain
tomorrow.
Say we come up with a linear equation for prediction which looks like this :
Rainpredicted = β0 + β1 ∗ T + β2 ∗ H
Where T and H are temperature and humidity respectively and β are estimated constants from the data.
Now; there can be many other factors as well as general randomness affecting rains. Our predictions with
equation given above will never be spot on. There will always be some error associated with it. We can
calculate this error as follows :
\i = β0 + β1 ∗ Ti + β2 ∗ Hi
Rain
where i can be taken as the day number and hat symbol on top of Raini means it is predicted. Now, to
calculate overall error we can sum the errors for all days as follows:
X
T otalError = (Raini − Rain
\i )
2
This is a bad idea to calculate overall error because many positive and negative errors will cancel each other
and simple summation will not give a good idea about true error. To avoid this issues we can either square
the errors and then sum them or take absolute values and then sum. The former is known as error sum of
squares or in short SSE.
X X
SSE = \i )2 =
(Raini − Rain (Raini − (β0 + β1 ∗ Ti + β2 ∗ Hi ))2
Cost Functions
You can see here , if i gave different values of βs , i will get different values of SSE . In machine learning lingo
SSE here is a cost function which we are trying to minimize and βs are parameters w.r.t. to which we are
trying to minimize the cost function.
In many coming modules we’ll discuss different algorithm where we’ll be defining different cost function as
per goal of the algorithm.
Now; mathematically optimizing the cost function in order to obtain best parameters is something which is
out of scope of this course. We’ll instead be focusing on understanding algorithms which can be used to solve
different kind of problems. underlying optimization algorithms for parameter estimations is implemented well
in the soft-wares and we’ll leave it at that.
However leaving that totally as a black box will not be good for our curious minds . I will discuss a very
simple yet powerful optimization idea known as gradient descent. Many underlying optimization algorithms
are nothing but some variation of the same.
Gradient Descent
Before we begin , here is a quick look back at one idea from our basic calculus:
Consider a function y = f (x) . In the picture below we are looking at very small segment of this function.
For that triangle formed in the picture , we can write
∆y
tan(θ) = (1)
∆x
For very small ∆x and ∆y we can write tan(θ) in terms of gradient as follows:
δy
tan(θ) = (2)
δx
δy
∆y = ∆x (3)
δx
Idea in (3) can be generalized to higher dimensions also. Now consider cost function C dependent on
parameters v1 and v2 .
C = f (v1 , v2 )
Lets say we start with some default values of these parameters say 1 and 2. Now i would like to change these
parameters to new value in such a way that change in my cost function is negative. Lets write change in cost
function using the relation between change and gradient seen in (3) for higher dimension.:
3
Figure 1: Gradient and Change in Function
4
δC δC
∆C = ∆v1 + ∆v2 (4)
δv1 δv2
δC δC
∇C = ( , ) (5)
δv1 δv2
and
∆C = ∇C • ∆v (7)
Now; we need to figure out how to change our parameters; that is, what should be the value of ∆v so that
change in cost function ∆C is always negative.
Consider :
∆v = −η∇C
2
∆C = −ηk∇Ck (8)
which is always negative as long as η is positive. Result in (8) simply means that we can change our parameters
in the negative direction of the gradient of cost function and can always reduce our cost function. At the
optimal point , gradient of the function will become zero and we wont be able to update parameters. Those
values of the parameters will be the optimal values.
This idea is simply known as gradient descent. Where , given the cost function, we can start with some
default value of parameters and change them in the manner shown above in order to obtain their optimal
values which minimise cost function.
In coming discussions on multiple algorithms we will encounter many kind of cost functions . Starting with
SSE in linear models in the next module.
Note: Keep in mind that , underlying parameter estimation implemented in various softwares
does not use gradient descent in the form we just discussed but some variations of the
same idea.