0% found this document useful (0 votes)
26 views

Module 1

The document provides an introduction and outline for a course on machine learning. It covers key topics like probability, distributions, linear algebra, and various learning paradigms. The outline includes introductions to probability, common data types, distributions like Bernoulli and normal, linear algebra concepts, and numerical linear algebra. It also discusses how linear algebra is important for statistics and gives examples of its use in machine learning, such as for datasets, images, encoding, regression, and deep learning.

Uploaded by

Infinityplusone
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

Module 1

The document provides an introduction and outline for a course on machine learning. It covers key topics like probability, distributions, linear algebra, and various learning paradigms. The outline includes introductions to probability, common data types, distributions like Bernoulli and normal, linear algebra concepts, and numerical linear algebra. It also discusses how linear algebra is important for statistics and gives examples of its use in machine learning, such as for datasets, images, encoding, regression, and deep learning.

Uploaded by

Infinityplusone
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 39

MODULE I

THE INGREDIENTS OF MACHINE


LEARNING

Dr. Shivani Gupta


OUTLINE
 Introduction to Probability and Distributions,
 Introduction to Linear algebra,

 Eigenvalues and Eigenvectors,

 Matrix Decomposition.

 Various Learning Paradigms,

 Perspectives and Issues,

 Version Spaces,

 Finite and Infinite Hypothesis Spaces,

 PAC Learning,

 VC Dimension.
PROBABILITY
 Probability is the branch of mathematics concerning
numerical descriptions of how likely an event is to occur,
or how likely it is that a proposition is true.
 The probability of an event is a number between 0 and 1,
where, roughly speaking, 0 indicates impossibility of the
event and 1 indicates certainty.
PROBABILITY OF AN EVENT

 The probability of an event can be calculated directly by


counting all of the occurrences of the event, dividing
them by the total possible occurrences of the event.
Probability of event to happen P(E) = Number of
favourable outcomes/Total Number of outcomes
The assigned probability is a fractional value and is always
in the range between 0 and 1, where 0 indicates no
probability and 1 represents full probability.
 For example, the probability of flipping a coin and it
being heads is ½, because there is 1 way of getting a
head and the total number of possible outcomes is 2 (a
head or tail). We write P(heads) = ½ .
 Probability theory has three important concepts:
 Event (A). An outcome to which a probability is assigned.
 Sample Space (S). The set of possible outcomes or events.
 Probability Function (P). The function used to assign a
probability to an event.
 The likelihood of an event (A) being drawn from the
sample space (S) is determined by the probability
function (P).
 The shape or distribution of all events in the sample
space is called the probability distribution. 
COMMON DATA TYPES

 Discrete Data, as the name suggests, can take only


specified values. For example, when you roll a die, the
possible outcomes are 1, 2, 3, 4, 5 or 6 and not 1.5 or
2.45.
 Continuous Data can take any value within a given
range. The range may be finite or infinite. For example,
A girl’s weight or height, the length of the road. The
weight of a girl can be any value from 54 kgs, or 54.5
kgs, or 54.5436kgs.
TYPES OF DISTRIBUTIONS

 Bernoulli Distribution
 Uniform Distribution

 Normal Distribution
BERNOULLI DISTRIBUTION

 A Bernoulli distribution has only two possible


outcomes, namely 1 (success) and 0 (failure), and a
single trial.
 So the random variable X which has a Bernoulli
distribution can take value 1 with the probability of
success, say p, and the value 0 with the probability of
failure, say q or 1-p.
 Here, the occurrence of a head denotes success, and the
occurrence of a tail denotes failure.
Probability of getting a head = 0.5 = Probability of
getting a tail since there are only two possible outcomes.
 The probability mass function is given by: px(1-p)1-
x
  where x € (0, 1).
It can also be written as

o The probabilities of success and failure need not be equally likely, like the
result of a fight between me and Mary Kom. She is pretty much certain to win.
So in this case probability of my success is 0.15 while my failure is 0.85
 The expected value of a random variable X from a
Bernoulli distribution is found as follows:
E(X) = 1*p + 0*(1-p) = p

 The variance of a random variable from a bernoulli


distribution is:
V(X) = E(X²) – [E(X)]² = p – p² = p(1-p)
UNIFORM DISTRIBUTION

 When you roll a fair die, the outcomes are 1 to 6. The


probabilities of getting these outcomes are equally likely
and that is the basis of a uniform distribution.
 Unlike Bernoulli Distribution, all the n number of
possible outcomes of a uniform distribution are equally
likely.
 A variable X is said to be uniformly distributed if the
density function is:
 The graph of a uniform distribution curve looks like

You can see that the shape of the Uniform distribution curve is rectangular,
the reason why Uniform distribution is called rectangular distribution.
For a Uniform Distribution, a and b are the parameters. 
EXAMPLE
 The number of bouquets sold daily at a flower shop is
uniformly distributed with a maximum of 40 and a
minimum of 10.
 Let’s try calculating the probability that the daily sales
will fall between 15 and 30.
 The probability that daily sales will fall between 15 and
30 is (30-15)*(1/(40-10)) = 0.5
 The mean and variance of X following a uniform
distribution is:

Mean -> E(X) = (a+b)/2


Variance -> V(X) =  (b-a)²/12

 The standard uniform density has parameters a = 0 and b


= 1, so the PDF for standard uniform density is given by:
 
NORMAL DISTRIBUTION

 Any distribution is known as Normal distribution if it


has the following characteristics:
 The mean, median and mode of the distribution coincide.
 The curve of the distribution is bell-shaped and symmetrical about

the line x=μ.


 The total area under the curve is 1.

 Exactly half of the values are to the left of the center and the other

half to the right.


 A normal distribution is highly different from Binomial
Distribution. However, if the number of trials approaches
infinity then the shapes will be quite similar.
 The PDF of a random variable X following a normal
distribution is given by:

The mean and variance of a random variable X which is said to be


normally distributed is given by:

Mean -> E(X) = µ


Variance -> Var(X) = σ^2

Here, µ (mean) and σ (standard deviation) are the parameters.


GRAPH
 The graph of a random variable X ~ N (µ, σ) is shown
below.
 A standard normal distribution is defined as the
distribution with mean 0 and standard deviation 1.
  For such a case, the PDF becomes:
INTRODUCTION TO LINEAR ALGEBRA
 Linear algebra is a branch of mathematics, but the truth
of it is that linear algebra is the mathematics of data.
 Matrices and vectors are the language of data. Linear
algebra is about linear combinations. That is, using
arithmetic on columns of numbers called vectors and
arrays of numbers called matrices, to create new
columns and arrays of numbers.
 Linear algebra is the study of lines and planes, vector
spaces and mappings that are required for linear
transforms
 A linear equation is just a series of terms and mathematical operations where some
terms are unknown; for example:
y=4×x+1
 Equations like this are linear in that they describe a line on a two-dimensional graph.

 The line comes from plugging in different values into the unknown x to find out what
the equation or model does to the value of y.
 We can line up a system of equations with the same form with two or more
unknowns; for example:
y = 0.1 × x1 + 0.4 × x2 y = 0.3 × x1 + 0.9 × x2 y = 0.2 × x1 + 0.3 × x2 · · · (1.2)
 The column of y values can be taken as a column vector of outputs from the equation.

 The two columns of integer values are the data columns, say a1 and a2, and can be
taken as a matrix A.
 The two unknown values x1 and x2 can be taken as the coefficients of the equation
and together form a vector of unknowns b to be solved.
 We can write this compactly using linear algebra notation as:

y=A·b
NUMERICAL LINEAR ALGEBRA
 The application of linear algebra in computers is often called
numerical linear algebra.
“numerical” linear algebra is really applied linear algebra.
 It is more than just the implementation of linear algebra
operations in code libraries; it also includes the careful
handling of the problems of applied mathematics, such as
working with the limited floating point precision of digital
computers.
 Computers are good at performing linear algebra
calculations, and much of the dependence on Graphical
Processing Units (GPUs) by modern machine learning
methods such as deep learning is because of their ability to
compute linear algebra operations fast.
LINEAR ALGEBRA AND STATISTICS
 Linear algebra is a valuable tool in other branches of
mathematics, especially statistics.
 Use of vector and matrix notation, especially with
multivariate statistics. ˆ
 Solutions to least squares and weighted least squares,
such as for linear regression. ˆ
 Estimates of mean and variance of data matrices. ˆ

 The covariance matrix that plays a key role in


multinomial Gaussian distributions. ˆ
 Principal component analysis for data reduction that
draws many of these elements together.
EXAMPLES OF LINEAR ALGEBRA IN
MACHINE LEARNING
Linear algebra is a sub-field of mathematics concerned
with vectors, matrices and linear transforms.
 It is a key foundation to the field of machine learning
from notations used to describe the operation of
algorithms, to the implementation of algorithms in code.
CONCRETE EXAMPLES OF LINEAR
ALGEBRA IN MACHINE LEARNING.
They are:
1. Dataset and Data Files
2. Images and Photographs
3. One Hot Encoding
4. Linear Regression
5. Principal Component Analysis
6. Singular-Value Decomposition
7. Recommender Systems
8. Deep Learning
DATASET AND DATA FILES
 In machine learning, you fit a model on a dataset.
 This is the table like set of numbers where each row
represents an observation and each column represents a
feature of the observation.
 For example, below is a snippet of the Iris flowers
dataset:
 This data is in fact a matrix, a key data structure in linear
algebra.
 Further, when you split the data into inputs and outputs
to fit a supervised machine learning model, such as the
measurements and the flower species, you have a matrix
(X) and a vector (y).
 The vector is another key data structure in linear algebra.
IMAGES AND PHOTOGRAPHS
 Perhaps you are more used to working with images or
photographs in computer vision applications.
 Each image that you work with is itself a table structure
with a width and height and one pixel value in each cell
for black and white images or 3 pixel values in each cell
for a color image.
 A photo is yet another example of a matrix from linear
algebra.
 Operations on the image, such as cropping, scaling,
shearing and so on are all described using the notation
and operations of linear algebra.
ONE HOT ENCODING
 Sometimes you work with categorical data in machine learning.
 Perhaps the class labels for classification problems, or perhaps
categorical input variables.
 It is common to encode categorical variables to make their
easier to work with and learn by some techniques.
 A popular encoding for categorical variables is the one hot
encoding.
 A one hot encoding is where a table is created to represent the
variable with one column for each category and a row for each
example in the dataset.
 A check or one-value is added in the column for the categorical
value for a given row, and a zero-value is added to all other
columns.
 For example, the variable color variable with the 3 rows:

Each row is encoded as a binary vector, a vector with zero or one values and
this is an example of a sparse representation, a whole sub-field of linear
algebra
LINEAR REGRESSION
 Linear regression is an old method from statistics for
describing the relationships between variables.
 It is often used in machine learning for predicting numerical
values in simpler regression problems.
 There are many ways to describe and solve the linear
regression problem, i.e. finding a set of coefficients that
when multiplied by each of the input variables and added
together results in the best prediction of the output variable.
 Even the common way of summarizing the linear regression
equation uses linear algebra notation:
 y =A· b

Where y is the output variable A is the dataset and b are the


model coefficients.
PRINCIPAL COMPONENT ANALYSIS
 Often a dataset has many columns, perhaps tens, hundreds,
thousands or more.
 Modeling data with many features is challenging, and models
built from data that include irrelevant features are often less
skillful than models trained from the most relevant data.
 Methods for automatically reducing the number of columns of a
dataset are called dimensionality reduction, and perhaps the most
popular is method is called the principal component analysis or
PCA for short.
 This method is used in machine learning to create projections of
high-dimensional data for both visualization and for training
models.
 The core of the PCA method is a matrix factorization method
from linear algebra. The eigen decomposition can be used.
SINGULAR-VALUE DECOMPOSITION
 Another popular dimensionality reduction method is the
singular-value decomposition method or SVD for short.
 As mentioned and as the name of the method suggests, it
is a matrix factorization method from the field of linear
algebra.
 It has wide use in linear algebra and can be used directly
in applications such as feature selection, visualization,
noise reduction and more.
RECOMMENDER SYSTEMS
 Predictive modeling problems that involve the
recommendation of products are called recommender
systems, a sub-field of machine learning.
 Examples include the recommendation of books based on
previous purchases and purchases by customers like you on
Amazon, and the recommendation of movies and TV shows
to watch based on your viewing history and viewing history
of subscribers like you on Netflix.
 The development of recommender systems is primarily
concerned with linear algebra methods.
 A simple example is in the calculation of the similarity
between sparse customer behavior vectors using distance
measures such as Euclidean distance or dot products.
DEEP LEARNING
 Deep learning is the recent resurged use of artificial neural
networks with newer methods and faster hardware that allow
for the development and training of larger and deeper (more
layers) networks on very large datasets.
 At their core, the execution of neural networks involves linear
algebra data structures multiplied and added together.
 Scaled up to multiple dimensions, deep learning methods work
with vectors, matrices and even tensors of inputs and
coefficients, where a tensor is a matrix with more than two
dimensions.
 Linear algebra is central to the description of deep learning
methods via matrix notation to the implementation of deep
learning methods such as Google’s TensorFlow Python library.
EIGENVALUES AND EIGENVECTORS
 A vector is an eigenvector of a matrix if it satisfies the following
equation.
A·v=λ·v
 This is called the eigenvalue equation, where A is the parent square
matrix that we are decomposing, v is the eigenvector of the matrix, and
λ is the lowercase Greek letter lambda and represents the eigenvalue
scalar. O
 A matrix could have one eigenvector and eigenvalue for each dimension
of the parent matrix.
 Not all square matrices can be decomposed into eigenvectors and
eigenvalues, and some can only be decomposed in a way that requires
complex numbers.
 The parent matrix can be shown to be a product of the eigenvectors and
eigenvalues. A = Q · Λ · Q T
Where Q is a matrix comprised of the eigenvectors
CALCULATION OF
EIGENDECOMPOSITION
 An eigendecomposition is calculated on a square matrix
using an efficient iterative algorithm, of which we will
not go into the details.
 Often an eigenvalue is found first, then an eigenvector is
found to solve the equation as a set of coefficients.
 he eigendecomposition can be calculated in NumPy
using the eig() function.
 The example below first defines a 3 × 3 square matrix.
The eigendecomposition is calculated on the matrix
returning the eigenvalues and eigenvectors.
CODE:
# eigendecomposition
from numpy import array
from numpy.linalg import eig
# define matrix
A = array([ [1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(A)
# factorize
values, vectors = eig(A)
print(values)
print(vectors)
CONFIRM AN EIGENVECTOR AND
EIGENVALUE
 We can confirm that a vector is indeed an eigenvector of
a matrix.
 We do this by multiplying the candidate eigenvector by
the value vector and comparing the result with the
eigenvalue.
 First, we will define a matrix, then calculate the
eigenvalues and eigenvectors. We will then test whether
the first vector and value are in fact an eigenvalue and
eigenvector for the matrix.
CODE
• The example multiplies the original
# confirm eigenvector matrix with the first eigenvector and
from numpy import array compares it to the first eigenvector
from numpy.linalg import eig multiplied by the first eigenvalue.
# define matrix • Running the example prints the
results of these two multiplications
A = array([
that show the same resulting vector,
[1, 2, 3], as we would expect.
[4, 5, 6],
[7, 8, 9]])
# factorize
values, vectors = eig(A)
# confirm first
eigenvector B = A.dot(vectors[:, 0])
print(B)
C = vectors[:, 0] * values[0]
print(C)

You might also like