0% found this document useful (0 votes)
286 views13 pages

ML Unit 1

This document provides an introduction to machine learning, including: 1. Definitions of machine learning, its relationship to other fields, and examples of applications. 2. Overviews of the history, need for, advantages, and disadvantages of machine learning. 3. Descriptions of key mathematical foundations for machine learning, including linear algebra, probability and statistics, Bayesian conditional probability, and descriptive statistics.

Uploaded by

2306603
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
286 views13 pages

ML Unit 1

This document provides an introduction to machine learning, including: 1. Definitions of machine learning, its relationship to other fields, and examples of applications. 2. Overviews of the history, need for, advantages, and disadvantages of machine learning. 3. Descriptions of key mathematical foundations for machine learning, including linear algebra, probability and statistics, Bayesian conditional probability, and descriptive statistics.

Uploaded by

2306603
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

UNIT I INTRODUCTION AND MATHEMATICAL FOUNDATIONS

What is Machine Learning? Need –History – Definitions – Applications - Advantages,


Disadvantages & Challenges -Types of Machine Learning Problems – Mathematical
Foundations - Linear Algebra & Analytical Geometry -Probability and Statistics- Bayesian
Conditional Probability -Vector Calculus & Optimization - Decision Theory - Information
theory

What is Machine Learning?

• Machine learning is a field of computer science that gives computers the ability to
learn without being explicitly programmed.
• Machine learning is closely related to (and often overlaps with) computational
statistics, which also focuses on prediction-making through the use of computers.
• It has strong ties to mathematical optimization, which delivers methods, theory
and application domains to the field.
• Machine learning is sometimes conflated with data mining, where the latter
subfield focuses more on exploratory data analysis and is known as unsupervised
learning.
• Machine learning can also be unsupervised and be used to learn and establish
baseline behavioral profiles for various entities and then used to find meaningful
anomalies.

Need ?

ML tools are artificial intelligence algorithmic applications that allow systems to understand and
improve without human intervention. Tools are essential for the following reasons:
We can prepare data with these tools. We can train models with these tools.

History

Machine learning history starts in 1943 with the first mathematical model of neural networks
presented in the scientific paper "A logical calculus of the ideas immanent in nervous activity"
by Walter Pitts and Warren McCulloch. Then, in 1949, the book The Organization of Behavior
by Donald Hebb is published.

Definition

Machine learning is a subset of AI, which enables the machine to automatically learn from data,
improve performance from past experiences, and make predictions.
Applications

Advantages

1. Scope of Improvement
2. Enhanced Experience in Online Shopping and Quality Education
3. Wide Range of Applicability
4. Automation

Disadvantages

1. Data Acquisition
2. Time and Resources
3. Results Interpretations
Mathematical Foundations

an introduction to key mathematical concepts at the heart of machine learning. The focus is on
matrix methods and statistical models and features real-world applications ranging from
classification and clustering to denoising and recommender systems.

Linear Algebra & Analytical Geometry

Linear algebra is the cornerstone of mathematical concepts in machine learning. A solid grasp of
vectors, matrices, and operations like matrix multiplication is essential for understanding
algorithms, developing models, and navigating the intricacies of data transformations.

Probability and Stastistics


Most people have an intuitive understanding of degrees of probability, which is why we use
words like “probably” and “unlikely” in our daily conversation, but we will talk about how to
make quantitative claims about those degrees [1].

In probability theory, an event is a set of outcomes of an experiment to which a probability is


assigned. If E represents an event, then P(E) represents the probability that Ewill occur. A
situation where E might happen (success) or might not happen (failure) is called a trial.

This event can be anything like tossing a coin, rolling a die or pulling a colored ball out of a
bag. In these examples the outcome of the event is random, so the variable that represents the
outcome of these events is called a random variable.

Let us consider a basic example of tossing a coin. If the coin is fair, then it is just as likely to come
up heads as it is to come up tails. In other words, if we were to repeatedly toss the coin many
times, we would expect about about half of the tosses to be heads and and half to be tails. In this
case, we say that the probability of getting a head is 1/2 or 0.5 .

The empirical probability of an event is given by number of times the event occurs divided by
the total number of incidents observed. If forntrials and we observe ssuccesses, the probability of
success is s/n. In the above example. any sequence of coin tosses may have more or less than
exactly 50% heads.

Theoretical probability on the other hand is given by the number of ways the particular event
can occur divided by the total number of possible outcomes. So a head can occur once and
possible outcomes are two (head, tail). The true (theoretical) probability of a head is 1/2.

Joint Probability

Probability of events A and B denoted byP(A and B) or P(A ∩ B)is the probability that events A
and B both occur. P(A ∩ B) = P(A). P(B) . This only applies if Aand Bare independent, which
means that if Aoccurred, that doesn’t change the probability of B, and vice versa.
Conditional Probability

Let us consider A and B are not independent, because if A occurred, the probability of B is higher.
When A and B are not independent, it is often useful to compute the conditional probability, P
(A|B), which is the probability of A given that B occurred: P(A|B) = P(A ∩ B)/ P(B).

The probability of an event A conditioned on an event B is denoted and defined P(A|B) =

P(A∩B)/P(B)

Similarly, P(B|A) = P(A ∩ B)/ P(A) . We can write the joint probability of as A and B as P(A ∩
B)= p(A).P(B|A), which means : “The chance of both things happening is the chance that the first
one happens, and then the second one given the first happened.”

Bayes’ Theorem

Bayes’s theorem is a relationship between the conditional probabilities of two events. For
example, if we want to find the probability of selling ice cream on a hot and sunny day, Bayes’
theorem gives us the tools to use prior knowledge about the likelihood of selling ice cream on any
other type of day (rainy, windy, snowy etc.).
where Hand E are events, P(H|E) is the conditional probability that event H occurs given that
event E has already occurred. The probability P(H) in the equation is basically frequency analysis;
given our prior data what is the probability of the event occurring. The P(E|H) in the equation is
called the likelihood and is essentially the probability that the evidence is correct, given the
information from the frequency analysis. P(E) is the probability that the actual evidence is true.

Let H represent the event that we sell ice cream and Ebe the event of the weather. Then we might
ask what is the probability of selling ice cream on any given day given the type of
weather? Mathematically this is written as P(H=ice cream sale | E= type of weather) which is
equivalent to the left hand side of the equation. P(H) on the right hand side is the expression that
is known as the prior because we might already know the marginal probability of the sale of ice
cream. In our example this is P(H = ice cream sale), i.e. the probability of selling ice cream
regardless of the type of weather outside. For example, I could look at data that said 30 people out
of a potential 100 actually bought ice cream at some shop somewhere. So my P(H = ice cream
sale) = 30/100 = 0.3, prior to me knowing anything about the weather. This is how Bayes’
Theorem allows us to incorporate prior information [2].

A classic use of Bayes’s theorem is in the interpretation of clinical tests. Suppose that during a
routine medical examination, your doctor informs you that you have tested positive for a rare
disease. You are also aware that there is some uncertainty in the results of these tests. Assuming
we have a Sensitivity (also called the true positive rate) result for 95% of the patients with the
disease, and a Specificity (also called the true negative rate) result for 95% of the healthy
patients.

If we let “+” and “−” denote a positive and negative test result, respectively, then the test
accuracies are the conditional probabilities : P(+|disease) = 0.95, P(-|healthy) = 0.95,

In Bayesian terms, we want to compute the probability of disease given a positive


test, P(disease|+).
P(disease|+) = P(+|disease)* P(disease)/P(+)
How to evaluate P(+), all positive cases ? We have to consider two
possibilities, P(+|disease) and P(+|healthy). The probability of a false positive, P(+|healthy), is the
complement of the P(-|healthy). Thus P(+|healthy) = 0.05.

Importantly, Bayes’ theorem reveals that in order to compute the conditional probability that you
have the disease given the test was positive, you need to know the “prior” probability you have
the diseaseP(disease), given no information at all. That is, you need to know the overall incidence
of the disease in the population to which you belong. Assuming these tests are applied to a
population where the actual disease is found to be 0.5%, P(disease)= 0.005 which
means P(healthy) = 0.995.

So, P(disease|+) = 0.95 * 0.005 /(0.95 * 0.005 + 0.05 * 0.995) = 0.088

In other words, despite the apparent reliability of the test, the probability that you actually have
the disease is still less than 9%. Getting a positive result increases the probability you have the
disease. But it is incorrect to interpret the 95 % test accuracy as the probability you have the
disease.

Descriptive Statistics

Descriptive statistics refers to methods for summarizing and organizing the information in a data
set. We will use below table to describe some of the statistical concepts [4].
Elements: The entities for which information is collected are called the elements. In the above
table, the elements are the 10 applicants. Elements are also called cases or subjects.

Variables: The characteristic of an element is called a variable. It can take different values for
different elements.e.g., marital status, mortgage, income, rank, year, and risk. Variables are also
called attributes.

Variables can be either qualitative or quantitative.

Qualitative: A qualitative variable enables the elements to be classified or categorized according


to some characteristic. The qualitative variables are marital status, mortgage, rank, and risk.
Qualitative variables are also called categorical variables.

Quantitative: A quantitative variable takes numeric values and allows arithmetic to be


meaningfully performed on it. The quantitative variables are income and year. Quantitative
variables are also called numerical variables.
Discrete Variable: A numerical variable that can take either a finite or a countable number of
values is a discrete variable, for which each value can be graphed as a separate point, with space
between each point. ‘year’ is an example of a discrete variable..

Continuous Variable: A numerical variable that can take infinitely many values is a continuous
variable, whose possible values form an interval on the number line, with no space between the
points. ‘income’ is an example of a continuous variable.

Population: A population is the set of all elements of interest for a particular problem. A
parameter is a characteristic of a population.

Sample: A sample consists of a subset of the population. A characteristic of a sample is called a


statistic.

Random sample: When we take a sample for which each element has an equal chance of being
selected.

Bayesian Conditional Probability

The probability of an event A based on the occurrence of another event B is termed conditional
Probability. It is denoted as P(A|B) and represents the probability of A when event B has
already happened.
where,
P(A) and P(B) are the probabilities of events A and B also P(B) is never equal to zero.
P(A|B) is the probability of event A when event B happens
P(B|A) is the probability of event B when A happens

Vector Calculus & Optimization

In the previous section, I asked you if you could figure out the derivative of the vector form of our
previous equation. Well, it is exactly the same as what we computed but written in a different
way. Simply compute the partial derivatives with respect to each variable. Recall our results
were 2x and -2y. We can write this in vector notation as [2x -2y]. This trivial example illustrates
the main idea behind vector derivatives. Basically, compute the partial derivative of the function
with respect to each of its variables and then write it down in vector form. There are two common
situations that you may encounter:

source: Notes from Marist college Machine Learning class by Dr. Lauria
source: Notes from Marist college Machine Learning class by Dr. Lauria

Gradient 💰

If you are reading this article I presume that you have explored the topic of Machine Learning
before. If so, you may have heard or read about the concept of a gradient, in particular Gradient
Descent which is the de facto approach for optimizing Machine Learning algorithms. Here is
where the money is! But, what is a gradient anyways? Well, a gradient is a vector that packages
all partial derivative information about a certain function and it is denoted with an upside-down
triangle ∇. Note that in the examples introduced in the previous section, the resulting vector is
prefaced by ∇ meaning that such vector of partial derivatives is the gradient of the function we
were interested in. It will be beneficial to visualize this concept before exploring it in further.

Optimization is all about minimizing a specific function. This means finding global minima. We
achieve this by computing the derivative of a function. This is where everything we just learned
comes into play. For a simple 2D continuous function just compute the derivative and set it equal
to zero. This is called the first-order condition. For a higher dimensional function, we compute
the partial derivatives instead. Consider the visualization presented above. We start at a random
point in the 3D surface. Then, we compute the partial derivative which results in a vector. As
vectors give us size and direction we can proceed to select a new point in the direction in which
the partial derivative is decreasing (because we are trying to minimize it). Keep doing this until
the partial derivative is zero or very close to zero. Done! You just optimized your first algorithm
without making any computations.

The gradient vector can be interpreted as the “direction and rate of fastest increase”. Note that
naturally, a gradient is maximizing a function because it tells us the rate of fastest increase, in
practice, we are often concerned with finding local or global minima so we compute the negative
gradient to make sure we are minimizing instead. If the gradient of a function is non-zero at a
given point, the direction of the gradient is the direction in which the function increases most
quickly from that point, and the magnitude of the gradient is the rate of increase in that direction.
The video I shared with you from Khan academy when we looked at partial derivatives visualized
this idea pretty neatly. I recommend you take a look at it one more time now that I have
introduced the concept of a gradient vector. I can’t emphasize enough the strength of the calculus
way of thinking I introduced in the first section. This is exactly what we are doing here. Again
and again.

I am a visual learner, and so I always love to explore new concepts visually. Here is the graph of a
3D function f(x,y)=x²+y², that takes in x and y as inputs and sums their squares, plotted along its
gradient vector (which results in a hyperplane much easier to deal with than the elliptical shape
the original function represents):
One partial derivative gives us the slope of a hyperplane tangent to f(x) at a given point. The
collection of all partial derivatives gives us the gradient vector which we use to optimize our ML
algorithms. Before looking into how we can implement these methods and ideas with Python I
want to give you the definition of Matrix calculus since it is a term you are likely to encounter
along your journey in Machine/Deep Learning and often causes confusion,

Matrix calculus is just a specialized notation for doing multivariable calculus such that the
collection of partial derivatives with respect to multiple variables of collections of multivariate
functions are converted into matrices and vectors that can be treated as single entities, thereby
simplifying optimization operations.

Decision Theory

Decision theory is a study of an agent's rational choices that supports all kinds of progress in
technology such as work on machine learning and artificial intelligence. Decision theory looks at
how decisions are made, how multiple decisions influence one another, and how decision-
making parties deal with uncertainty.

Information Theory

Information theory represents data in the form of codewords. These codewords are
representation of the actual data elements in the form of sequence of binary digits or bits. There
are various techniques to map each symbol (data element) with the corresponding codeword.

You might also like