MachineLearningNotes PDF
MachineLearningNotes PDF
Source: https://www.appliedaicourse.com/course/11/Applied-Machine-learning-course
Available Modules: 2, 3, 4, 5, 7, 8
Module 1 is just Python introduction, Module 6 and 9 have case studies and live
sessions; Case studies and live sessions are not covered in this notebook;
[Google Images]
Module 1: Fundamentals of Programming
Link: https://www.youtube.com/watch?v=hbUJ6nd-9lA
https://repo.continuum.io/archive/
We can’t use a keyword as variable name, function name or any other identifier;
[Import keyword]
Example: False, None, True, class, if, else, return, def, try, while, for, etc
Identifiers:
Can be a combination of letters, digits and underscores, cannot start with a digit
Indentations are used (4 spaces preferred) to make blocks of code, a for loop
Rather than writing code in a single line try to write in multiple lines (can use \) to make
code readable
Variable is a location in memory used to store some data; Variable declaration is not
needed
a, b = 10, ‘Hi’
Data types:
List: An ordered sequence of items, like an array, can have multiple data type
elements, defined with square brackets; Lists are mutable
Tuple: Defined with parenthesis, can have multiple data type elements, tuple is
immutable, can be indexable
Set: Defined with Curly braces, Set is an unordered collection of unique items;
behaves as a set in mathematics; does not support indexing
Data types can be converted provided the value is valid in both data types;
List(‘Hello’) = [‘H’,’e’,’l’,’l’,’o’]
2.6 Standard input and output
Output: print()
print(‘ {} {}’.format(a,b))
Input: input()
2.7 Operators
Operators are special symbols in python that allow arithmetic or logic computation.
-15//2 = -8
Bitwise:
a= 10, b = 4:
a | b: 1110 = 14
a>>b:
Assignment operator:
a += 10: a = a + 10
Identity operators:
is, is not
Membership operators:
In, not in
If test expression:
statement(s)
Example: num = 10
if num>0:
print(“number is positive”)
print(“zero”)
else:
print(“number is negative”)
print(“always printed”)
index = 0
product * = lst[index]
statement(s)
product *=ele
if condition:
break
Chapter 3: Python for Data Science: Data Structures
3.1 Lists
List: Sequence data structures, these are indexable, mutable, defined by square brackets
and elements are comma separated
Operations on list:
3.4 Sets
S = {1, 2, 3}
set([1,2,3,1]) = (1, 2, 3)
3.5 Dictionary
Dictionary is mutable;
Dictionary Comprehension:
S = “kl” or = str(1)
Palindrome:
Mystr = “MaDam”
Mystr = Mystr.lower()
revStr = reversed(Mystr)
if list(Mystr) == list(revStr):
print(“Palindrome”)
else:
print(“Not palindrome”)
Alphabetic sort:
Word = word.split().sort()
Chapter 4: Python for Data Science: Functions
4.1 Introduction
def function():
‘’’
Doc string
‘’’
statements
return
Scope and Life Time of Variables: Portion of the code where the variable is
recognized and Lifetime is the period throughout
which the variable exists in memory
Variable inside a function are local variables which are destroyed once the function
finishes execution; Global variables are not destroyed unless deleted;
“””
“””
hcf = 1
hcf = i
return hcf
Built-in: abs(), all(), any(), dir(), divmod(), enumerate(), filter(), map(), reduce(),
isinstance(),
def PowerOfTwo(num):
Return num**2
map(PowerOfTwo, list)
Default arguments: to give default values to a function; have default arguments at the
end
Arbitrary arguments: Used when number of arguments are unknown given as input to
the function;
Factorial(n) = n*Factorial(n-1)
Def Double(x):
return x*2
4.6 Modules
Module: example.py
import example
- import math
- math.pi
import math as m
import datetime
4.7 Packages
File = open(‘example.txt’)
‘a’, appending
Closing a file:
import os
os.mkdir(‘test’)
import shutil
shutil.rmtree(‘test’) # removes non-empty folders
Use:
Raising exceptions:
Finally block runs at the end of all operations, such as closing the file even if the file is
not written, to save data.
EDA: Simple analysis to understand data; tools include statistics, linear algebra and
plotting tools
IRIS dataset: Hello World of Data Science; Collected in 1936; Classify a flower into 3
classes;
Use plotly;
We can write straight forward rules (if else) to separate one of the class labels;
10.4 Limitations of Pair plots
Uni-Variate analysis:
Histograms:
Differentiate cdf to get pdf and integrate pdf to get cdf; Convert data into bins to get
pdf, and use npp.cumsum to get cdf;
We can get intersections of CDF levels on x axis above to know how many points of each
class intersect
10.9 Median
Median does not get corrupted due to presence of outliers; better statistic for central
tendency;
Sort the list in an increasing order and pick the middle value;
50th percentile: the value at which 50% of values are less than this value
10.11 IQR (Inter Quartile Range) and MAD (Median Absolute Deviation)
The box plot (a.k.a. box and whisker diagram) is a standardized way of displaying the
distribution of data based on the five number summaries: minimum, first quartile,
median, third quartile, and maximum. (Google Search)
A violin plot is a method of plotting numeric data. It is similar to a box plot, with the
addition of a rotated kernel density plot on each side. Violin plots are similar to box
plots, except that they also show the probability density of the data at different values,
usually smoothed by a kernel density estimator.
[Google Search]
Write conclusion at end of each step or plots during data analysis, data analysis should
align with project objective;
Univariate: Analysis considering only one variable (PDF, CDF, Box-plot, Violin plots)
Combining probability densities of two variables; dense regions are darker as if a hill coming out
Chapter 11: Linear Algebra
We will apply it to solve specific type of problems in ML; we will learn things in 2d and
3d and extend it to nd
11.2 Introduction to Vectors (2-D, 3-D, n-D), Row Vector and Column Vector
Point:
Dot product: a.b = a1b1 + a2b2 + ….. anbn = aT * b (by default every vector is a column
vector)
If the dot product of two vectors is zero then the two vectors are orthogonal.
11.5 Equation of a line (2-D), Plane (3-D) and Hyperplane (n-D), Plane passing through
origin, Normal to a Plane
Plane 3D: ax + by + cz + d = 0
Wn+1x1 = [W1, W2, …. ]; X n+1 x 1 = [x1, x2, … ] (W0 determines intercept of the
hyper plane on y – axis] (To pass through origin this intercept must be 0)
π: WTx = 0: Hyper plane passing through origin
d = abs(WTP / ||W||)
A hyper plane divides the whole space into two half spaces;
Given a point: if its distance from the center is less than the radius of a circle
then the point lies inside the circle;
x12 + x22 < r2 then the point p(x1,x2) lies inside the circle that is centered at
origin;
x12 + x22 > r2 then the point p(x1,x2) lies outside the circle;
if x1 < a2 and x1>a1: if y1>b1 and y1<b2 then point P (x1, y1) lies inside the axis parallel
rectangle, if a2 – a1 = b2 – b1 then we get a square
Can be extended to 3D
if x < a2 and x>a1: if y>b1 and y<b2: if z>c1 and z<c2 then point P (x, y,z) lies inside the
axis parallel cuboid, if a2 – a1= b2 – b1 = c2 – c1 then we get a cube;
Datasets with well separable classes do not require ML methods to be applied, but
classes that are well mixed cannot be separated easily;
The data points that lie in the mixed region cannot be clearly stated to belong to a
certain class; instead we can have probability values of the data point belonging to each
class;
Histograms, PDF, CDF, mean, variance, etc come under probability and statistics
Ex: Dice with 6 sides: Roll a fair dice: the 6 outcomes of the dice are equally likely
r.v. X = {1, 2, 3, 4, 5, 6}
Outliers: 92.6 and 12.26: may have occurred due to input error or data collection
error or may be a genuine value but does not indicate general trend of the
population
Collecting all the population height data is not possible; we will take a random sample
(subset of the population)
Real world variables mostly follow Gaussian distribution; Height, Weight, Length
N(µ,2σ2);
If the PDF has a mirror image of the curve of the either sides of the mean, then the
distribution is symmetric over mean; as in above std dev plot;
Sample Skewness:
Kurtosis:
This will give us an idea of outliers in the distribution; Smaller the better
Mean centering and scaling (X – mean) / std dev; To help understand the disb.
12.7 Kernel density estimation
At every point in the range of the Gaussian Kernels we will add all PDF values to get a
combined PDF; bandwidth selection is done by experts;
Pick m random Samples independently of size n each S1, S2, S3, …., Sm
If the plot is approximately a straight line then the distributions are similar;
Ex: Imagine your task is to order T shirts for all employees at your company. Let sizes be
S, M, L and XL. Say we have 100k employees who have different size requirements;
Let us have a relationship between heights and T shirt size; Domain knowledge;
Collect heights from 500 random employees; Compute mean and std dev
Q) Salaries: If salaries are Gaussian distributed, we can estimate how many employees
make a salary > 100k $;
If we don’t know the distribution and mean is finite and standard deviation is non-zero
and finite;
Salaries of individuals (millions of values); distribution is unknown, but mean and std
dev are known;
Parameters (a,b,n=b-a+1)
Pmf = 1/n
Variance = ((b-a+1)2-1)/12
Skewness = 0
Continuous Uniform:
Parameter: a,b
PDF: 1/(b-a)
Variance = (b-a)2/12
When a distribution follows a power law then the distribution is called Pareto
distribution;
Power transform:
Box-cox(x) = λ
Box-cox does not work always. Use QQ-plot to check its results.
A well studied distribution gives a theoretical model for the behavior of a random
variable
Weibull distributions:
The upstream rainfall determines the height of a dam which stands for 100s of
years without repairs; Probability of rainfall > a value is required;
This distribution is applied to extreme events such as annual maximum one day
rainfalls and river discharges.
12.19 Co-variance
Cov(x, x) = var(x)
Spearman rank corr coeff: this is Pearson correlation coefficient of ranks of the
random variables; Spearman correlation coefficient is
much more robust to outliers than Pearson Corr coef
This allows us to understand whether two random variables increase at the same time;
Pearson Spearman
12.22 Correlation vs Causation
As x increased y increased;
Causations are studied with causal models which is a separate field of study;
Useful for education ministry to encourage people to get more years of study;
Q. Is time spent on web page in last 24 hours correlated with money spent in the next
24 hours?
Useful for ecommerce to encourage people to spend more time on the website;
Q. Is number of unique visitors to the website correlated with the $ sales in a day?
The company will then take measures to increase number of unique user in a day;
Distribution of X is unknown;
With μ we have a point estimate and μ is sample mean that is calculated over a
randomly selected sample of the population;
Say we have 170 cm as sample mean; we can say that sampling mean lies between 165
and 175 cm for 95% of the sampling experiments when the repeat the sampling multiple
times; for 5% of the experiments the mean will not fall in this interval; This does not
mean that population mean will lie in the interval with 95% probability;
We can get 95% CI from Gaussian distribution characteristics which range from μ - 2σ to
μ + 2σ, which is 158 to 178 cm;
What is the 95% CI for population mean; the distribution can be anything;
CLT: sample mean x_bar follows Gaussian distribution with mean = population
mean and standard deviation = population standard deviation/sqrt(sample size);
We can say that μ ε [x_bar - 2σ/sqrt(n), x_bar + 2σ/sqrt(n)] with 95% confidence;
x_bar = 168.5
n = 10
Population Mean lies between 165.34 and 171.66 with 95% confidence;
ν = sample size – 1
We have CI for mean; what if we like to have CI for standard deviation, median, 90 th
percentile and other statistics;
12.27 Confidence interval using bootstrapping
Generate k samples;
This is a non parametric technique to estimate CI for a statistic; and this process is called
as bootstrap;
12.28 Hypothesis testing methodology, Null-hypothesis, p-value
Let us have 2 classes; and let the number of students be 50; let us have height values;
Say we plot histograms and saw that mean of class 2 is greater than mean of class 1;
Hypothesis testing:
Say we have p value = 0.9 for probability of observing a mean difference of 10cm
when H0 is true;
As p value is high than a threshold level (generally 5%) then we accept the null
hypothesis else we reject null hypothesis; Also this p value does not tell about
the acceptance of alternate hypothesis or the probability of acceptance of null
hypothesis;
P value says that what is the probability of an observation statistic occurs under
null hypothesis;
Example: Given a coin determine if the coin is biased towards heads or not;
Design an experiment: flip a coin 5 times and count # of heads = X random variable;
Perform experiment: Let that we got 5 heads; test statistic X = 5 the observed value;
We reject that the coin is unbiased and accept that the coin is biased;
We have sample size as a choice; we have 5 coin flips, we can have 3 flips, 10 flips or 100
flips; and can perform the experiment;
P-value is the probability of observation given H0; we cannot say anything about
probability of H0 being true;
We have 2 classes with 50 height observations each; Take means and compute the
difference between the means, let this difference be D;
We then mix all height values into a 100 value vector and randomly sample 50 points
from the 100 points to form X vector and rest 50 into Y vector;
We have mean of X and mean of Y: the difference of these means is the mean difference
of Sample 1;
Similarly we will generate n samples and compute mean difference; we will have n
mean differences; (Note to make random sampling for picking 50 points)
Let n = 10 000;
The mean differences are sorted; We will now have D1, D2, …., D10000 in sorted order;
Now place the original mean difference before sampling in the sorted means list;
Computing the percentage of Di values that are greater D will generate a p-value;
Say D ~ D9500; then we have 500 D values (from D9501 to D10000) that are greater
than D; thus we have 5% of values; hence p-value = 5%; we can compute p-value with
this process;
Why is this percentage considered as p-value; Initially while jumbling we assumed that
there is no difference in the means of the 2 classes or 2 height lists; this is the
assumption of null hypothesis; then we have experimented with sampling for 10000
iterations;
If the percentage of the sorted sample differences that are greater than original mean
difference D is less than threshold value 5% then the sampling is not random; this
implies that null hypothesis should be rejected;
Kolmogorov-Smirnov test:
Null hypothesis: the two random variables come from same distribution;
Normal distribution:
Uniform distribution:
P-value is low thus X does not follow Normal distribution (we already know that X
follows Uniform distribution)
12.33 Hypothesis testing: another example
Difference of means:
Given heights list of two cities, determine if the population means of heights in these
two cities is same or not;
Experiment: We cannot collect the whole population data; we will take a sample; say we
collect height of 50 random people from both the cities;
Compute sample means from both cities μA (let = 162 cm) and μB (let = 167 cm); these
are sample means as population data collection is infeasible;
Alternative hypothesis: μB – μA != 0
As 20% is significant then null hypothesis must be true; We accept the null
hypothesis;
We have:
Step 2: From set S, randomly select 50 heights for S1 and 50 heights for S2
Compute means of these S1 and S2 heights: μ1 and μ2; we have re-sampled the
data set; with this we are making an assumption that there is no difference in μB
and μA, which is the null hypothesis;
Compute: μ2 - μ1
Result: we will have 1000 (μ2 - μ1); sort these; these are simulated differences
under null hypothesis
Compute the percentage of simulated means that are greater than test statistic; this is
the p-value which can be checked for significance at α level;
P(observed difference|H0) = percentage of simulated means that are greater than test
statistic = p-value; if p-value > 5% accept null hypothesis else reject null hypothesis;
Observed difference can never be incorrect as it is ground truth generated from the
data; Acceptance or rejection happens with null hypothesis;
Drug testing: Effectiveness of a new drug over an old drug; claim new drug makes
recovery fever from fever in comparison to old drug;
Collect 100 patients: randomly split into two groups of 50 people each;
Administer old drug to group A and new drug to group B; record time taken for
all the 100 people to recover;
Let mean time for old drug people be 4 hours and for new drug people be 2 hrs
Mean tells that the new drug is performing well; note that the sample size is 50;
thus hypothesis is applied;
H0: Old drug and new drug take same time for recovery;
If there is no difference in old drug and new drug then the probability of
observing μold – μnew = 2is 1%; this implies that the null hypothesis and
observation do not agree with each other, thus null hypothesis is incorrect;
Let d: [d1, d2, d3, d4, d5] = [2.0, 6.0, 1.2, 5.8, 20.0]
Task: pick an element from the list such that the probability of picking the element is
proportional to the value;
For random selection the probability of picking any value is equal to all other values;
b. Divide list with the sum, list = [0.0571, 0.171428, 0.0343, 0.1657, 0.5714]
In C_list the gap between values is proportional to the original values; as random
number r is generated from a uniform distribution; the probability of picking any
value is equal to the gap between the elements in C_list in turn proportional to
the original list values;
1. What is PDF?
2. What is CDF?
3. Explain about 1-std-dev, 2-std-dev, 3-std-dev range?
4. What is Symmetric distribution, Skewness and Kurtosis?
5. How to do Standard normal variate (z) and standardization?
6. What is Kernel density estimation?
7. Importance of Sampling distribution & Central Limit theorem
8. Importance of Q-Q Plot: Is a given random variable Gaussian distributed?
9. What is Uniform Distribution and random number generators
10. What Discrete and Continuous Uniform distributions?
11. How to randomly sample data points?
12. Explain about Bernoulli and Binomial distribution?
13. What is Log-normal and power law distribution?
14. What is Power-law & Pareto distributions: PDF, examples
15. Explain about Box-Cox/Power transform?
16. What is Co-variance?
17. Importance of Pearson Correlation Coefficient?
18. Importance Spearman Rank Correlation Coefficient?
19. Correlation vs Causation?
20. What is Confidence Intervals?
21. Confidence Interval vs Point estimate?
22. Explain about Hypothesis testing?
23. Define Hypothesis Testing methodology, Null-hypothesis, test-statistic, p-value?
24. How to do K-S Test for similarity of two distributions?
Chapter 13: Interview Questions on Probability and statistics
We can visualize 2D and 3D using Scatter Plots; up to 6D we can leverage Pair Plots. For
nD > 6D we should reduce the dimensionality to make it an understandable or a
visualizable dataset.
By reducing dimensionality, we transform the features into a new set of features which
are less in number. The aim lies in preserving the variance which ensures least possible
loss of information.
X_i belongs to R(d) X_i is a d dimensional column vector and several data points are
stacked in row wise to form a dataset.
In a matrix: each row is a data point and each column vector is a feature (preferred form
of representation, this is one type of representation scheme)
Y is a column matrix in which each row corresponds to the class that the data point x_i
belongs to.
Geometrically: Project all data points onto each feature axis. The central vector of all
these projections is the mean on each feature axis. The set of the means is the
mean vector of the data matrix.
Column Normalization: values are squished to a unit hyper-cube [0, 1], this gets rid of
scales of each feature
Column Standardization: In a data matrix, with rows as data points and columns as
features
a1, a2, a3, a4, …., an for each feature: are transformed to a1’, a2’, a3’, ….., an’ such that
ai’ = (ai – a_mean) / a_std
Geometrically: Project all data points on feature axes; we can get mean and standard
deviation of the dataset. After applying column standardization the data points
will have mean at origin and the data points are transformed to a unit standard
deviation.
Sij = ith row and jth column in co-variance matrix, is a square matrix
i: 1 d, j: 1 d
cov(x,x) = var(x) -- 1
In co-variance matrix, we have diagonal elements as variance values and the matrix is
symmetric.
Each data point: x_i = Image (28 x28), y_i = {0, 1…., 9}
(28 x 28) is converted into (784 x 1) shape; by stacking all elements from each row into a
column vector
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
print(d0.head())
l = d0[‘label’]
d = d0.drop(“label”, axis = 1)
idx = 100
plt.show()
print(l(idx))
Chapter 15: PCA (principal component analysis)
1) For visualization
2) d’ < d for model training
If we want to reduce dimensionality from 2D to 1D, we can skip feature f1 as the spread
across the feature is very less than spread across the feature f2.
Another case: Let y = x be the underlying relationship between two features and let X be
column standardized, and the plot is such that there is sufficient variance on both
features. Say, we rotate the axes and reach an orientation where spread on f 2’ << f1’
(perpendicular to f2’) then we can drop f2’. So data transformation can also find
maximum variance features. This will help us reduce dimensionality.
We want to find a direction f1’ such that the variance of the data points projected onto
f1’ is maximum and skip features with minimum spread.
15.3 Mathematical objective function of PCA
u1 : Unit vector: in the direction where maximum variance of projection of x_i exists
|| u1||2 = 1
= u1 . xi / || u1||2 = u1T xi
xi‘ = u1T xi
Variance maximization:
Distance minimization:
Co-variance matrix of X = S
Eigen value of S: λ1, λ2, λ3, λ4, …. (λ1 > λ2 > λ3 > λ4 …. )
Eigen vectors are perpendicular to each other. This implies, that dot product of pairs of
Eigen vectors is zero. ViTVj = 0
Steps:
1. X: Column Standardized
2. S = XTX
3. λ1, V1 = eigen(s)
4. u1 = V1 direction with maximum variance
Transform xi = [f1i, f2i] to xi’ = [xiTV1, xiTV2], we can drop xiTV2 as there is not much
variance: 2D data point is transformed to 1D
https://colah.github.io/posts/2014-10-Visualizing-MNIST/
MNIST dataset:
- vectors = vectorsT
- X’ = vectors*X
Using PCA:
- pca = decomposition.PCA()
- pca.n_components = 2
- pca_data = pca.fit_transform(sample_data)
d' = 2 or 3 is for visualization, but if d’= 200 or any other value, then we are doing data
reduction for non-visualization tasks.
Stack Eigen vectors horizontally to get Eigen matrix V. Multiply X with V to get X’ which
is dimensionally reduced. d' can be determined by checking explained variance. The task
is to maximize variance.
pca_data = pca.fit_transform(sample_data)
percentage_var_explained = np.cumsum(percentage_var_explained)
Chapter 16: (t-SNE) T-distributed Stochastic Neighborhood Embedding
PCA – basic, old: did not perform well on MNIST dataset visualization
tSNE vs PCA:
PCA preserves the global structure of the dataset and discards local structure. tSNE also
preserves local structure of the dataset.
Neighborhood points of a data point are the points that are geometrically close
to the data point.
Embedding: For every point in the original dataset, a new point is created in
lower dimension corresponding to the original data point. Embedding means
projecting each and every input data point into another more convenient
representation space (Picking a point in high dimensional space and placing it in
a low dimensional space).
Let us assume: x1, x2, x3, x4 and x5 are some data points, and let x1, x2 and x3 be in a
neighborhood region, x4 and x5 be another neighborhood region and both
neighborhood regions are very far from each other. When we reduce the dimensionality
we follow the objective of preserving the distance between data points.
For points which are not in neighborhood the distances are not preserved.
Link: https://www.youtube.com/watch?v=NEaUSP4YerM
Link: https://www.youtube.com/watch?v=ohQXphVSEQM
KL divergence is asymmetric: gives large cost for representing nearby data points in high
dimensional space by widely separated points in low dimensional space.
Non-neighbor data points can also be seen to crowd with neighborhood data points.
Link: https://distill.pub/2016/misread-tsne/
Processes all the data in multiple iterations and tries to separate different
neighborhoods.
Parameters:
Perplexity: Run tSNE with multiple perplexity values. As perplexity increases from 1,
the neighborhood tries to get good clusters and then with further
increments the clustered profile becomes a mess (more number of data
points are considered to belong to same neighborhood).
Stochastic part of tSNE induces probabilistic embedding, results change every time tSNE
is run on the same data points with same parameter values.
tSNE also expands dense clusters and shrinks sparse clusters, cluster sizes cannot be
compared. tSNE does not preserve distances between clusters.
Never rely on the results of tSNE, experiment it with changing parameter values. You
may need to plot multiple plots to visualize.
16.6 t-SNE on MNIST
Tsne_data = Model.fit_transform(data_1000)
1. You are given a train data set having 1000 columns and 1 million rows. The data set is
based on a classification problem. Your manager has asked you to reduce the dimension
of this data so that model computation time can be reduced. Your machine has memory
constraints. What would you do? (You are free to make practical
assumptions.)(https://www.analyticsvidhya.com/blog/2016/09/40-interview-questions-
asked-at-startups-in-machine-learning-data-science/)
Ans. Close all other applications, random sample dataset, reduce dimensionality
(remove correlated features, Use correlation for Numerical and chi-square test
for categorical, Can use PCA)
If we don’t rotate the components, the effect of PCA will diminish and we’ll have
to select more number of components to explain variance in the data set.
3. You are given a data set. The data set contains many variables, some of which are highly
correlated and you know about it. Your manager has asked you to run PCA. Would you
remove correlated variables first? Why?(https://www.linkedin.com/pulse/questions-
machine-learning-statistics-can-you-answer-saraswat/)
Ans. Since, the data is spread across median, let’s assume it’s a normal distribution.
We know, in a normal distribution, ~68% of the data lies in 1 standard deviation
from mean (or mode, median), which leaves ~32% of the data unaffected.
Therefore, ~32% of the data would remain unaffected by missing values.
Module 3: Foundations of Natural Language Processing and Machine Learning
Chapter 18: REAL WORLD PROBLEM: PREDICT RATING GIVEN PRODUCT REVIEWS ON
AMAZON
Contains: Review ID, ProductId, UserId, ProfileName, Text, Summary, Score, etc.
Task: Given a review determine whether the review is positive; this helps whether we
need to do any improvements in the product;
A single person giving reviews to multiple at the same time stamp and same
review; these can be thought of as duplicate data;
pd.DataFrame.drop_duplicates(list of features)
Other such errors in data needs to be found out; such as values that need to be
less than 1, data types, etc.
Linear algebra can be applied when the input is in numerical data type;
If we plot the data points of n dimensional vector (including vectorized) text data, we
can utilize linear algebra in terms of distance measurements and the side to which the
data points exist with respect to a plane that separates the positive data points from
negative data points;
Example:
Construct a vector of size d; each element in the vector belongs to words such
as ‘a’, ‘an’, ‘the’, ‘pasta’, …, ‘tasty’, ….
Most of the elements in the vector are zero; thus we will have a sparse matrix;
BOW: The reviews R1 and R2 are completely opposite but the data points are closer to
each other; BOW does not perform well when there is a small change in the data;
BOW depend on count of words in each review, this discards the semantic meaning of
the documents and documents that are completely opposite in meaning can lie closer to
each other when there is very small change in the document (with respect to the words)
The underlined words above are stop words which do not add any value to the semantic
meaning of the document;
Now with this we will have sparsity reduced; (nltk tool kit has stop words)
Stemming: Tastes, tasty and tasteful: indicate the same meaning which relates to taste;
With stemming we can combine words which have same root words;
Lemmatization takes lemma of a word based on its intended meaning depending on the
context of surrounding words;
18.6 UNI-GRAM, BI-GRAM, N-GRAMS.
Uni grams consider each single word for counting; Bi grams considers two consecutive
words at a time; We can have n grams similarly;
Uni grams based BOW discards sequence of information; with n grams we can retain
some of the sequence information;
BOW: one way to vectorize text data: using n grams, preprocessing and there were
variations of bag of words;
TF(Wi, rj) = # of times occurs in rj / total number of words in rj; can also be thought of as
probability of finding a word Wi in a document rj;
rj: jth review, Dc: Document corpus, Wi: ith word; N: #of documents, ni = # docs which
contain Wi;
Rare words in documents that occur more frequently in a review will have high TF-IDF
value; TFIDF = tf(Wi, rj) * idf(Wi, Dc);
More importance to rarer words in corpus and give more importance to words that are
frequent in a review;
IDF without a log will have large numbers that will dominate in the ML model;
18.9 WORD2VEC.
Word is transformed into a d dimension vector; This is not a sparse vector and generally
has dimensionality << BOW and TFIDF
If two words are semantically similar, then vectors of these words are closer
geometrically;
Word2vec learns these properties automatically; Why part of the Word2vec can
be learnt in Matrix Factorization;
Working: Takes a text corpus (large size): for every word its builds a vector: dimension of
the vectors is given as an input; this uses words in the neighborhood: if the
neighborhood is similar then the words have similar vectors;
Google took data corpus from Google News to train its Word2Vec method;
18.10 AVG-WORD2VEC, TF-IDF WEIGHTED WORD2VEC
Vector V1 of review r1: avg w2v(r1) = (1/# words in r1) (w2v(w1) + w2v(w2) + ….)
Sklearn’s CountVectorizer()
With sparse matrices, we can store sparse matrices with a memory efficient method by
storing the row and column indices with corresponding values;
Removing html tags, punctuations, stop words, and alpha numeric words, ensuring
number of letters >2, lowercases, using stemming and lemmatization;
Python module re: regular expressions; Can be used extensively for text preprocessing;
Using gensim module: stemming words may not have w2v vectors in pre trained model
Working of classification:
We are using text which is most informative; text is converted into a vector and based
on vector we have either positive or negative reviews;
Task of classification is to find a function that takes input text vector and gives output as
positive or negative for any new review text;
Y = F(X)
The classification model gets training data and the model learns the function F observing
examples; then this learnt function is used to predict the output of new unseen input
data; Unseen data implies that the data is not used during training;
Given input matrix X; each row Xi represents a review text vector; for each review we
will have a class label which is denoted as Yi;
In Machine Learning we will leverage Linear Algebra for computations; We cannot input
text into Linear Algebra; thus even y value are vectorized; here we have two classes thus
we can have a binary label 0 or 1; 1 representing positive review, 0 for negative review;
For MNIST: Dn = {(xi, yi) | xi ε Rnxn, yi = {0, 1, 2, ……., 9}}: Multi class classification
Blue points: + ve data points; Orange points: - ve data points; Yellow: query point (xq)
Task: Predict the class of query point: It can be classified into a class based on the
classes of the neighbors of the query data point;
Get nearest neighbors and apply majority vote; If K is even then majority votes can
result in ties, thus avoid even numbers for k
1. Outliers (nearest neighbors are far to the outlier query point, assigning a class
based on kNN is not good)
x1 ε Rd and x2 ε Rd:
||x1||2 = (summ(x1i)2)1/2
L1 norm of vector
||x1||1 = (summ(abs(x1i)))
P>2
X1 = [0, 1, 1, 0, 1, 0, 0]
X2 = [1, 0, 1, 0, 1, 0, 1]
X1 = ‘abcadefghik’
X2 = ‘acbadegfhik’
Similarity vs Distance: As distance increases the two points are less similar;
cos_dist = 1 – cos_sim
cos_sim(x1, x3) = 1
cos_dist(x1, x3) = 1 – 1 = 0
Even though dist(x1, x3) > dist(x1, x2): cos_dist(x1, x3) < cos_dist(x1, x2);
Break the dataset into two sets train and test and there is no intersection
between train and test data points (each data point either goes to train set or to
test set)
Splitting can be done randomly (one of the methods) into train set = 70% of the
total dataset and test set = rest 30% of the total dataset;
For each point in test dataset; make data point as a query point, then use train
set and kNN model to determine yq; then if class label is correct increment a
counter by 1; for metrics we can use this counter which is the number correct
classifications in the test set; accuracy can be computed by (counter/test size);
Output: yq
Knnpts = []
Space: At evaluation:
Dtrain ~O(nd)
k = 1;
Curves that separate +ve points from –ve points are called decision surfaces;
For k = 1:
For k = 5:
For k = n:
Underfit: Not generating any decision surface and classifying all query points with
majority class;
Neither overfit nor underfit: Generating smooth surfaces that are less prone to noise;
We train the model using training set and compute accuracy using test set;
For every k we determine test accuracy and select k that gives best accuracy on test set;
Using test dataset to determine best k or best hyper parameters is not right;
Thus we split the data set into train, cross validate and test datasets; we want the model
to have well generalization ability on future unseen data;
While test set should not be touched, cross validation data becomes untouched while
training which will lead to loss of information;
We can use k-fold cross validation to incorporate cross validation data during training;
We are trying to get the information from cross validation set for ML model training;
With k’ fold cross validation time taken for finding optimal k is multiplied by k’ times;
Error: 1 – accuracy
If train error is low and validation error is high then we are overfitting;
For Amazon fine food reviews time based splitting is better as we also have timestamp
feature in the dataset;
If time stamps are available we need to check which of random split or time based split
is better;
In random splitting: we will get reviews that were written before will occur in test set
and the reviews that were written after will occur in train set;
In time based splitting, we will get reviews in train, cross validation and test set based
on time stamps; the reviews that were written first occur in train set, then in cross
validation and then in test set; the argument here is we should avoid predicting past
data based on models trained on future datasets;
Reviews or products change over time; new products get added or some products get
be terminated;
As time passes we require retraining the model with new data;
Regression: Dn = {(xi, yi) | xi ε Rd, yi ε R}; we will do mean or median of all k neighbors
We can multiply the labels of nearest neighbors with reciprocal of the distance rather
using simple majority vote;
This divides the whole data space into regions based on nearest neighbor;
kNN: Time complexity: O(n) if d is small and k is small and Space complexity: O(n)
Binary search tree: Given a sorted array, we can find presence of a number in the
array in O(log n) time
BST: Given a list of sorted numbers we can build a tree that divides the list in two at
every stage
Kd – Tree:
1. Pick first axis and project all data points on to the axis; pick the middle point; draw a
plane through the middle point this divides the space into two equal halves;
2. Change axis to y;
3. Change to x axis again and repeat steps; the data space is broken using axis parallel
lines or hyper planes into hyper cuboids
19.23 FIND NEAREST NEIGHBOURS USING KD-TREE
Draw a hyper sphere with centre xq and radius Dist(c,q); If there is another point is
inside the hypersphere; this can also be checked as the hyper sphere intersects with y =
y1 line;
We track back to y <= y1 condition in the tree and search for another nearest data point;
now point e can also be a nearest neighbor;
Dist(q, e) < Dist(c,e): thus c is discarded as 1 nearest neighbor and take e as nearest
neighbor; we repeat the steps done with c now with e; we can have a result that point e
is the nearest neighbor to query point q;
Best case Time complexity is O(log n) and worst case t.c. is O(n);
For k Nearest Neighbors Time complexities: Best case O(k log n) and Worst case: O(kn)
O(log n) is valid for uniform distribution and as distribution moves towards real world
clusters, time complexity moves towards O(n);
19.25 EXTENSIONS
When we need to find an element in the array we can use the hash table to look
up for key value that is equal to query element and the value in the hash table
will give us indices;
Locality sensitive hashing: it computes a hash function such that nearest data points
pairs are stored in the same bucket; for a new query point relevant bucket is searched
for and k nearest neighbors are searched through data points in this bucket; this will
reduce the need for searching throughout the data space;
LSH is a randomized algorithm; it will not give same answer every time; gives answers
based on probability;
If two points are close in terms of angular distance, then the points are similar; these
points should go to same bucket in hash table;
We break feature space using random hyper planes; for every plane we will have a unit
normal vector ‘wi’
For plane 1: Say we have x1 above and x3 below the hyper planes; we will have:
W1T . x1 >=0
W1T . x2 <=0
Random hyper planes are generated by generating a random W vector; each
value in W vector is randomly generated from Normal distribution N(0,1) (0
mean and 1 variance);
The key will hash function value and all the data points will be stored in value bucket;
Time to construct the hash table: O(mdn) : m hyper planes; d dimensionality, n data
points
Space O(nd)
Time at test time: O(md +n’d); n’ is number of data points in hash table bucket
LSH for cosine similarity can miss nearest neighbors when the nearest neighbor
falls in opposite side to any hyper plane;
When iterating across LSH we can get nearest neighbors in a bucket in some
iteration; over iterations at each bucket key of the hash table we can do union of
all data points;
Simple extension:
Every point is projected on each axis of the data space; hash function will
vectorize the data point into same dimensionality vector where each cell
represents the part of the axis the data point projection lies on that axis;
On x axis x1 and x2 fall in same bucket, on y axis these are very far;
19.29 PROBABILISTIC CLASS LABEL
Say, with 3 NN we get a query data point prediction as positive and with 5 data points
we can get the prediction as negative;
Also, say we go for 7NN; with a query data point it might happen that it nearest
neighbors are 4 +ve and 3 –ve; say another query data point has 7 +ve and 0 –ve
neighbors; then both the cases give prediction as +ve based on majority vote; but both
data points do not have same number of +ve # of data points, query data point 1 is less
+ve than query data point 2; thus giving a probabilistic quantification of the prediction
will give us more confidence;
sklearn.neighbors.KNeighbirsClassifier()
1. In k-means or kNN, we use euclidean distance to calculate the distance between nearest
neighbours. Why not manhattan distance
?(https://www.analyticsvidhya.com/blog/2017/09/30-questions-test-k-nearest-
neighbors-algorithm/)
2. How to test and know whether or not we have overfitting problem?
3. How is kNN different from k-means
clustering?(https://stats.stackexchange.com/questions/56500/what-are-the-main-
differences-between-k-means-and-k-nearest-neighbours)
4. Can you explain the difference between a Test Set and a Validation
Set?(https://stackoverflow.com/questions/2976452/whats-is-the-difference-between-
train-validation-and-test-set-in-neural-netwo)
5. How can you avoid overfitting in KNN?
21.1 INTRODUCTION
Rather than knowing multiple algorithms; it is important to know type of problems that
arise in real world;
Create new dataset with 100 n1 points and 100 n2 points randomly
selected; result 100 n1 and 100 n2 points;
Create new dataset with 900 n1 points by repeating each point 9 times;
and 900 n2 points; repeating more points from minority class to make
the dataset a balanced dataset;
We are not losing any data; we can also give weights to classes; more
weight to minority class;
When directly using the original imbalanced dataset; we can get high accuracy with a
dumb model that predicts every query point to belong to majority class;
21.3 MULTI-CLASS CLASSIFICATION
MNIST: y ε {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
kNN uses majority vote, it can easily be extended to multi class classification very easily;
The dataset is broken into parts such that for the first classifier we will have
dataset with class labels class1 present and class1 absent;
We will build c binary classifiers for multi class classification problem with c
labels;
We will have rows and column with indices as data points and cell values will be
similarity between row data point and column data point; we can convert
similarity matrix into distance matrix using reciprocal of each similarity values;
We can use kNN for this distance or similarity matrices as kNN works on
distances;
We can have new products getting added after some time or old products removed;
We need to check the distribution of train and test data sets for any change over time;
Train a binary classifier on Dn’; if the accuracy is high then Dtrain and Dtest are dis-
similar; we will have small accuracy if Dtrain and Dtest are similar; to get Dtrain and
Dtest follow same distribution (stable data) we will need to get small accuracy;
kNN is highly impacted by outliers when k is small; decision surfaces change due to
outliers;
For every data point compute get k Nearest Neighbors, compute average of k distances;
sort data points with by average distances; this will remove global outliers;
21.8 K DISTANCE
If LOF(xi) is large, then xi is an outlier; it is large when lrd(xi) is small compared to its
neighbors, that is the density near the point is small compared to its nearest neighbors;
Sklearn.neighbors.LocalOutlierFactor()
Features in a dataset can have different scales; the distance measures will not be
proper; the features that have large scales dominate the distance measurements;
Example:
X1 = [23, 0.2]
X2 = [28, 0.2]
X3 = [23, 1.0]
Dist(X1, X2) = 5;
% difference in X1 and X2 = 0.05 and X1 and X3 = 0.8, though X1 and X3 are far
compared to X1 and X2, Euclidean distance(X1, X2) < E. d. (X1, X3)
Since Euclidean distance can be impacted by scale; column standardization should be
applied before computing distances;
21.13 INTERPRETABILITY
A model that does not give reasoning or does not provide easy access to its
decisions or predictions computation is called a black box model;
kNN is an interpretable model; as it can show the nearest data points based on
which it has made prediction; the doctor can read similar patient records and
come to a conclusion whether a patient has cancer or not;
The data point vector can be comprised of results of medical tests; such as
weight, blood group, etc.
Important features: Features those are useful for the machine learning model in making
predictions; this improves model interpretability; feature importance allows us
understand a model better;
Forward feature selection: Given a model f through forward feature selection we can
use the model itself to get feature importance; Given a high dimensionality dataset we
want to reduce dimensionality to make things easier for computations (curse of
dimensionality); One way is to use PCA/ tSNE; but PCA and tSNE care about distances
and do not care about classification task; but for classification task using forward feature
selection we can discard less important features;
1. Given a dataset of d features; use each feature at a time to train an ML model,
the performance of the model is noted with respect to each feature; the feature
that gave highest accuracy is selected say this is fs1 (feature selected at stage1)
2. Retrain the model with remaining features in concatenation with fs1 one at a
time; we will get fs2; here fs2 + fs1 will give highest accuracy;
3. Repeat these steps up to fsd; these stage wise concatenation of features to gain
high performance from the model is called forward feature selection;
Note: At first stage we have the second best feature; it may happen that this feature in
combination with the first best feature may not provide good performance; so at each
stage we check that given that a feature or set of features are selected previously as
best features we now explore for features that add most value to model performance;
We can have a backward selection; where we try to remove the feature that results in
lowest drop in performance of the mode;
At any iteration we are training the ML model; time complexity is very high;
For a height prediction regression problem, we have weight, country, hair color, etc as
features; Weight is a numerical feature which can be given an input to the model; Hair
color and country are categorical text input; ML models take numerical as input; we
require to convert text and categorical features into equivalent numerical feature;
1. Say for hair color we have black, red, brown, Gray as values; we can assign
numbers to each color as black = 1; red = 2, etc; but numbers are said to be
ordinal; 2 is greater than 1 but red greater than black is absurd; with this
numerical conversion we are inducing an artificial order in the categories;
2. Thus one hot encoding is a better option; each category in the color feature is
made as a new binary feature; disadvantage: dimensionality of the dataset
increases and the dataset will be a sparse matrix as in each row there will only
one cell which is non zero;
3. Mean replacement category wise: Replace country column by the average height
of the people from that country;
4. Using domain knowledge: example for country we can replace the value with
distance from some reference country; or coordinate location on map; say we
have some fact that person near equator are tall and person away from equator
is short which is stated by a domain expert;
Ordinal features such as ratings: we can convert that into numbers which can be
compared (V. good, good, avg, bad, v. bad); = (5, 4, 3, 2, 1) or (10, 6, 4, 2, 1)
decisions on this are random;
Given a dataset Dn, we could have missing values due to many reasonable reasons;
We can impute missing values with mean, median or mode of the feature;
We can have Mean of all points or mean of all positive points or negative points;
When d is high:
In machine learning; let all features be binary where 2 is minimum possible number
of values for the feature0 and 1; for d number of features we will have 2 d possible
values for the dataset; so as dimensionality increases the number of data points
required for the model to perform well increases exponentially;
Hughes phenomenon: for fixed number of training samples the predictive power
reduces as the dimensionality increases;
As dimensionality increases the number of data points required for the model to
perform well increases incrementally;
If a large number of random data points are selected in a high dimensionality space,
the minimum of distance between every pair of points is roughly similar to the
maximum of distance between every pair of points; this makes comparison of
Euclidean distances for data points similarity impossible as dimensionality increases;
For high dimensional data, the impact of curse of dimensionality is high for dense
data compared to the impact on sparse data;
kNN on text data: cosine similarity and sparse representations as these are less
impacted;
Additionally:
You can also consider the length of diagonal for a unit hyper cube:
Thus even though the data points lie in the same unit hyper cube they are very
far;
Similarly the average distance of random data points in a nD hyper cube is given
as sqrt(10n/n);
Thus the average distance you can get from a 6D dataset is very large; and real
world problems come with dimensionality in range of 100s and 1000s, which
worsens the distance measurements and the intuition of similarity among data
points;
Variance: how much a model changes as training data changes; if the model does
not change much with changes in training data we will have a low variance
model ;
High bias leads to underfitting and high variance leads to overfitting thus we
need a balance between bias and variance; as variance increases bias decreases;
Given dataset D split into Dtrain and Dtest; a model is built on train data; We will have
train error and test error;
If train error is high we will have high bias which means the model is underfitting; if train
error is low then bias will be low
If train error is low and test error is high we will have model overfitting on train data;
this model will have high variance; we can also observe for changes in the model
predictions due to changes in training data;
kNN:
22.1 ACCURACY
Case 1: if we have an imbalanced dataset, with a dumb model we can get high
accuracies; Accuracy cannot be used for imbalanced datasets;
Case 2: If we have models which return a probability score; an inferior model predicts
probability values far from true labels (near 0.5 predictions) while a powerful model
predicts probability values nearer to true labels (far from 0.5 predictions); accuracy can
say that an inferior model is working similar to a powerful model; thus accuracy cannot
give an idea whether a model is inferior or powerful;
In a binary classification problem we have two classes; a confusion matrix can be built
with column vectors as predicted class labels and row vectors as actual class labels;
Confusion matrix cannot process probability values;
To construct a confusion matrix we need to have true class labels and predicted class
labels;
For Multi class classification: we will have a matrix of size cxc where c is the number of
classes;
If the model is sensible, then the principal diagonal elements will be high and off
diagonal elements will be low;
a = True Negatives, b = False Negatives, c = False Positives, d = True Positives
False Negatives: Predicted Negatives that are false, these points are actually positives
False Positives: Predicted Positives are wrong, these are actually negatives;
FPR = FP / N
FNR = FN / P
Model is good if TPR, TNR are high and FPR, FNR are low;
The condition for which of TPR, TNR, FPR and FNR should be considered depends on
domain;
For medical purposes: we don’t want to miss a patient who has a disease; rather a
patient with no disease can further be sent to tests; but a patient cannot be left
untreated due to False Negatives;
Precision = TP/ (TP + FP); of all the predicted positives how many are actually positive
Recall = TP / (TP + FN); of all actual positives how many are predicted positive
F1 score = 2 * Pr * Re / (Pr + Re); high precision will result in high precision and high
recall; f1 score is harmonic mean of precision and recall;
Thresholding:
We can have n thresholds with n TPR and FPR;
These TPR and FPR can be plotted which generates a curve called Receiver Operating
Characteristic Curve;
AUC ranges from 0 to 1; 1 being ideal; AUC can be high even for a dumb model when
the data set is imbalanced;
AUC is not dependent on the predicted values, rather it considers the ordering; if two
models give same order of predicted values then AUC will be same for both the models;
Preferred AUC is a value > 0.5; AUC = 0.5 for a random model and AUC between 0 and
0.5 imply that the predictions are reversed; if the predictions are again reversed then
the new AUC value will be 1 – old AUC;
22.5 LOG-LOSS
Log loss takes care of mis-classifications; this metric penalizes even for small deviations
from actual class label
A simplest model can output for every query point a mean of the whole dataset;
R2 = 1 – (SSres/ SStot);
Case 4: SSres > SStot; R2 < 0: this model is worse than a simple mean model
SSres = summ (ei2); if one of the errors is very large, then R2 is highly impacted, it is not
robust to outliers;
Very few errors are large; ideally we require 0 errors; from CDF we can get the
percentage of data points that have errors;
1. What is Accuracy?
2. Explain about Confusion matrix, TPR, FPR, FNR, and TNR?
3. What do you understand about Precision & recall, F1-score? How would you use it?
4. What is the ROC Curve and what is AUC (a.k.a. AUROC)?
5. What is Log-loss and how it helps to improve performance?
6. Explain about R-Squared/ Coefficient of determination
7. Explain about Median absolute deviation (MAD)? Importance of MAD?
8. Define Distribution of errors?
Chapter 23: INTERVIEW QUESTIONS ON PERFORMANCE MEASUREMENT MODELS
P(D1 = 2) = 1/6
Independent events:
P(A|B) = P(B|A) = 0;
Posterior = likelihood*prior/evidence
P(A, B) = P(B, A)
Example:
P(A3|B) = ?
In the training phase of Naïve Bayes we calculate all likelihood probabilities and
evidence probability;
P(class = No|x’) > P(class = Yes|x’); thus the prediction will be class = No for x1 data
point;
As compared to kNN Naïve Bayes is space efficient at run time; we can have low latency
applications;
Text is vectorized;
Naïve Bayes is often used as baseline benchmark model for text classification and
suitable problems; all other algorithms are compared with Naïve Bayes performance;
24.8 LAPLACE/ADDITIVE SMOOTHING
At the end of training all the likelihoods and priors are computed;
At test time:
We will use Laplace smoothing (not Laplacian smoothing) as the word is not
present in training data ideas such as making its likelihood as 0 or 1 or 0.5 is not
right; thus we will add a smoothing value to the numerator and denominator for
likelihood probability of the new word;
Laplacian smoothing is applied to all words in training data and also to new words that
occur in test data;
While the probabilities are multiplied the result will be extremely low which will affect
our model performance and result interpretation; we will also land in rounding up
errors for low decimal values;
When α = 0 small change in Dtrain results in large change in the model; high variance,
overfitting;
When α = Very large; all likelihoods will be equal to 0.5; this results in underfitting, a
high bias model; this will result in predicting majority class as class labels for all test data
points;
Sort the probability values in descending order; for each class we will get an order of
features which are important;
Features which have high likelihoods are most important features in classifying a data
point;
Interpretability: Based on likelihood probability of features we can get why a data point
is classified to a certain class;
Class priors favors dominating class while comparing probabilities of a data point
features belong to a class;
When laplace smoothing is applied: alpha impacts more for minority class;
24.13 OUTLIERS
Outliers at test time are taken care by Laplace smoothing; and if a word occurs less
frequently then discard that word;
24.14 MISSING VALUES
Numerical: Imputation
Let us assume that the numerical feature follows a Gaussian distribution with some
mean and standard deviation; before this we need to consider the data points that
belong to the class in consideration; after considering the class data points only we can
compute the probability from PDF of the distribution of the feature; the distribution PDF
can be computed from mean, standard deviation and the assumption of Gaussian
distribution;
Naïve Bayes has a fundamental assumption that all features are conditionally
independent;
Naïve Bayes cannot be directly applied on distance matrix; we need probability values;
Naïve Bayes can be extensively used for text classification; as dimensionality increases
we need to consider using log of probabilities;
24.19 BEST AND WORST CASES
NB can still work reasonably well; as opposed to theoretical rigor where NB should work
only for conditionally independent features;
NB is extensively used for Text classification and categorical features; NB is not much
used to real valued features as real world distributions come in varied forms other than
Gaussian;
NB is interpretable, we have feature importance, low runtime complexity, low train time
complexity, low run time space complexity; NB is basically performing counting;
If data is linearly separable (a hyper plane can separate the data points into two classes);
Assumptions:
||W|| = 1
Then, di = WTxi;
Now we will have a classifier where if the data point is in the same direction of the
normal vector then it belongs to positive class else negative;
For positive data points: yi WTxi > 0 the data point is correctly classified
For negative data points: yi WTxi > 0 the data point is correctly classified
For all data points: yi WTxi > 0 the data point is correctly classified
W* = argmax(summ(yi WTxi))
This formulation is largely impacted by outliers or any changes in the training data;
Idea of squashing: if the signed distance is small use it as it is and if the signed distance
is large make it small;
We have sigmoid function which has this property; we can have other functions
Sigmoid(x): 1/(1+e-yWx)
So if a point lies on the Hyperplane we will have WTx = 0; then this point belonging to
positive or negative class is 0.5; this can be seen from sigmoid(0) = 0.5;
Sigmoid is easily differentiable and has probabilistic interpretation which helps in solving
the optimization problem;
yi = {-1, +1}
The formulation is same but the sigmoid version will be less impacted by outliers
Probabilistic interpretation:
yi = {0, 1}
yi = {-1, +1}
Geometric intuition: The W vector is the optimal weight vector normal to a hyper plane
that separates data points into classes where positive data points are in the direction of
W;
For Logistic Regression: If WTxq > 0 then yq = +1; If WTxq < 0 then yq = -1, if the
point is on the hyper plane we cannot determine the class of the query point;
Case 1: If Wi = +ve then if xqi increases Wixqi increases, sigmoid (WTxq) increases,
then P(yq=+1) increases;
Case2: If Wi = -ve then if xqi increases Wixqi decreases, sigmoid (WTxq) decreases,
then P(yq=+1) decreases and P(yq=-1) increases;
Let Zi = yi WT xi/||W||
If the selected W results in correctly classifying all training points and if Zi tends to
infinity, then W is the best W on training data; this is a case of overfitting on training
data as this does not guarantee good performance on test data; the train data may
contain outliers to which the model has been perfectly fitted;
If λ is large then the loss term is diminished; the training data does not
participate in the optimization and we are just optimizing for the regularization
term; this leads to an underfitting model high bias;
L1 Regularization:
If L2 Regularization is used Weight for less important features becomes small but
remains non-zero;
No book encountered that covers all of geometric, probabilistic and loss minimization
interpretations of logistic regression;
Naive Bayes: real valued features are Gaussian distributed and class label is a Bernoulli
random variable;
P(Y=1|X) = 1/(1+exp(WTx))
Logistic Regression: Gaussian Naïve Bayes + Bernoulli;
Link: https://www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdf
25.8 LOSS MINIMIZATION INTERPRETATION
Compute cross validation error and select the best hyper parameter value from
the plot;
Random Search: avoids brute force and reduces the time spent for hyper parameter
tuning by looking at a smaller set of hyper parameter choices; the technique considers
random hyper parameter values for computing cross validation error in turn provides
best hyper parameter choices that optimizes the algorithm;
If there is co-linearity then we cannot interpret feature importance from weight vector;
Two features are collinear if a feature can be expressed as a function of other feature;
To use Weight vector for interpreting feature importance then we need to remove multi
collinear features;
Multi collinear feature can be determined by adding noise to features; if we add small
noise to the values that is perturbation to the feature and if the weight vector varies
(after training)a lot then the features are multi collinear; we then cannot use W vector
for feature importance;
Training Logistic Regression: Solving the optimization problem using Stochastic Gradient
Descent;
Train time:
Run time:
Space: O(d)
Time: O(d)
Logistic Regression is applicable for low latency applications; popular algorithm for
internet companies, memory efficient at run time;
As λ increases, bias increases, latency decreases and Sparsity increases; but the model is
just a working model not an optimized model as regularization and Sparsity are induced;
Imbalanced data: Upsampling or down sampling; with imbalanced data we will have
incorrect hyper planes that can optimize the loss function where the hyper plane can
make all the data points lie on one side optimizing the loss function for majority class;
We can compute W from Dtrain through training and generate distances through
WTxi; the points that have large distances are outliers can be removed to make
Dtrain free from outliers; the model is retrained on the outlier free dataset to get
final Weight vector
Mssing values: mean, median or model imputation;
Input similarity matrix: Extension of LR, Kernel LR (SVM) can be trained on similarity
matrix;
Best case:
Data is linearly separable, low latency, faster to train; high dimensionality (higher
chance that data is linearly separable and we can use L1 regularization)
Worst case:
A circularly separable data set can be transformed into linearly separable data set by
transforming features into a quadratic space;
F12 + F22 = R2 is not linear in F1 and F2 but linear in F12 and F22;
Sklearn.linear_model.LogisticRegression; sklearn.model_selection.GridSearchCV
GridSearchCV will train for 5 hyper parameter values;
Link: http://cs229.stanford.edu/notes/cs229-notes1.pdf
Chapter 26: LINEAR REGRESSION
Linear Regression: Find a hyper plane that best fits the training data (continuous
variable data)
yi = WTxi + b
π: best fit plane; the points that are not on the hyper plane have errors due to incorrect
prediction;
As there are positive and negative valued errors we need to make the values free from
signs; we can use squared error;
Formulation:
Regularization:
Loss minimization:
For classification: Zi = yi f(xi) for logistic regression we have sigmoid function, we can also
have step function or hinge loss which form other ML algorithms;
For regression: Zi = yi – f(xi); f(xi) = WTxi + W0; and the loss function is squared loss;
Terminology:
Feature Importance and Interpretability: same as for Logistic Regression; we can use
weight vector elements if the data does not have multi collinear features;
Outliers: In logistic regression we have sigmoid function that is squashing and limiting
the impact of outliers;
In Linear Regression we will have squared loss; to remove outliers we can compute
distances of a Hyperplane that best fitted on the training set; the data points that are
very far from the hyper plane are removed and the hyper plane is regenerated to fit the
outlier free data set; iterable up to satisfaction;
Outliers impact the model heavily; this technique of iterated removal of outliers from
model training is called RANSAC;
Boston housing dataset; load data and split data; do EDA and Feature engineering;
27.1 DIFFERENTIATION
y = f(x)
dy/dx = change in y due to change in x = (y2 – y1)/ (x2 – x1) = slope of the tangent to f(x)
Link: https://www.derivative-calculator.net
Most of the functions cannot be readily solved; we will use SGD to solve optimization
problems;
Logistic loss:
Solving the logistic loss is hard and thus we can use gradient descent technique
Iterative algorithm; initially we make a guess on the solution and we move towards the
solution iteratively through solution correction;
Gradient Descent:
If learning rate does not reduce, gradient descent can jump over the optimum and this
can be an iterative jump over where the algorithm does not reach the optimum; we are
having oscillations without convergence;
We should reduce step size that is the learning rate is reduced at every iteration such
that the convergence is guaranteed;
With Gradient descent we need to compute the updates over all data points which is
expensive;
Linear regression:
We can get solution for optimization by solving the partial derivatives of Lagrangian
function with respect to x, λ and μ.
Sparsity implies that most of the elements of the weight vector are zero;
In optimization formulation for comparison Loss and λ can be ignored as they are same
for both L1 and L2 regularizations;
L2 formulation has: min W for (W12+W22+ …. + Wd2): this is a parabola;
1. After analysing the model, your manager has informed that your regression model is
suffering from multicollinearity. How would you check if he’s true? Without losing any
information, can you still build a better model?(https://google-interview-
hacks.blogspot.in/2017/04/after-analyzing-model-your-manager-has.html)
2. What are the basic assumptions to be made for linear
regression?(https://www.appliedaicourse.com/course/applied-ai-course-
online/lessons/geometric-intuition-1-2-copy-8/)
3. What is the difference between stochastic gradient descent (SGD) and gradient descent
(GD)?(https://stats.stackexchange.com/questions/317675/gradient-descent-gd-vs-
stochastic-gradient-descent-sgd)
4. When would you use GD over SDG, and vice-
versa?(https://elitedatascience.com/machine-learning-interview-questions-answers)
5. How do you decide whether your linear regression model fits the
data?(https://www.researchgate.net/post/What_statistical_test_is_required_to_assess
_goodness_of_fit_of_a_linear_or_nonlinear_regression_equation)
6. Is it possible to perform logistic regression with Microsoft
Excel?(https://www.youtube.com/watch?v=EKRjDurXau0)
7. When will you use classification over regression?(https://www.quora.com/When-will-
you-use-classification-over-regression)
8. Why isn't Logistic Regression called Logistic Classification?(Refer
:https://stats.stackexchange.com/questions/127042/why-isnt-logistic-regression-called-
logistic-classification/127044)
External Resources:
1. https://www.analyticsvidhya.com/blog/2017/08/skilltest-logistic-regression/
2. https://www.listendata.com/2017/03/predictive-modeling-interview-questions.html
3. https://www.analyticsvidhya.com/blog/2017/07/30-questions-to-test-a-data-scientist-on-
linear-regression/
4. https://www.analyticsvidhya.com/blog/2016/12/45-questions-to-test-a-data-scientist-on-
regression-skill-test-regression-solution/
5. https://www.listendata.com/2018/03/regression-analysis.html
Module 4: Machine Learning – II (Supervised Learning Models)
The data points can be separated with many hyper-planes. Say, with
In logistic regression, if a data point is very close to the hyper plane, then the probability
of the point belonging to a class is very close to 0.5. For points away from hyper plane
the probability will be close to 1.
The objective will be to separate the data points with a hyper plane which is as far as
possible from the data points.
π2 is better than π1 as the hyper plane is as far as possible from all data points. Thus, π2
can be taken as a margin maximizing hyper plane.
Take π+ and π- parallel to π2 such that it touches first data point of the respective group.
We get a margin and with SVM we maximize the width of this margin.
Support Vector: Say we have the margin maximizing hyper plane π, and we have π+ and
π-, the data points through which these margin planes pass through are called support
vectors. We have two support vectors in the following plot.
Build smallest convex hulls for each set of data points, find the shortest line connecting
these hulls, and draw a perpendicular bisector to this line to get the margin
maximization hyper plane.
29.2 Mathematical derivation
Find hyper plane that does margin maximization; Let this hyper plane be W TX + b, where
W is perpendicular to the hyper plane.
Margin = 2/||W||
If the data points are mixed to some extent or some data points lie in the margin or
some positive data points lie on negative side of the hyper plane. We will never find a
solution for these cases. The above constraint imposes a hard margin on the SVM
model. Thus the margin needs to be relaxed.
Introduction of a new variable which tells how far the data point is in the opposite
direction of the hyper plane. This variable introduced is ξ i , which is equal to zero if the
data point is on the correct direction of the hyper plane and is equal to the distance
from the hyper plane if it is on the other side.
(W*, b*) = argmin (W,b) ((||W||/2) + C * (1/n) sum(ξ i)) | yi (WTX + b) >= 1 - ξ I for all i
This is an objective function which says to maximize the margin and minimize errors.
It does not matter, we can take +k and –k and also want both the support vector planes
be equidistant from the margin maximizing hyper plane. +1 and -1 are chosen for
mathematical convenience.
Zi = yi f(xi) = yi (WTXi + b)
Hinge Loss:
C increases - Overfitting
λ increases – Underfitting
The dual formulation of SVM does not depend only on Support Vectors.
SVM can include similarity between data points using the Dual form.
K - Kernel function tells about similarity about the two data points.
Link: http://cs229.stanford.edu/notes/cs229-notes3.pdf
We have:
Without kernel trick the formulation is called as Linear SVM; Task: find margin
maximizing hyper plane; Results look similar for Linear SVM and Logistic Regression;
Quadratic kernel:
Mercer’s theorem: Kernel trick converts d dim dataset to d’ dim data set s.t. d’ > d;
Qs: How RBF Kernel is related to KNN? How is SVM related to Logistic Regression? How
is Polynomial Kernel related to feature transformations?
We have seen Polynomial and RBF kernels: RBF-SVM ~ KNN and RBF kernel is a general
purpose kernel; Kernel-trick ~ Feature Transformation – Domain specific
Use Sequential Minimal Optimization (SMO) for specialized algorithm (dual formulation)
Ex: libSVM
C-SVM: C>=0
Nu is upper bound on fraction of errors and lower bound on number of Support Vectors
RBF-Kernel SVR: similar to KNN Regression where nearest data points are calculated and
a mean of the values of data points is calculated;
Links: https://alex.smola.org/papers/2004/SmoSch04.pdf,
https://youtu.be/kgu53eRFERc
29.13 Cases
- RBF with a small sigma can be impacted with outliers as in case of kNN with small k
https://scikit-learn.org/stable/modules/svm.html
1. Give some situations where you will use an SVM over a RandomForest Machine
Learning algorithm and vice-versa.
(https://datascience.stackexchange.com/questions/6838/when-to-use-random-forest-
over-svm-and-vice-versa)
2. What is convex hull?(https://en.wikipedia.org/wiki/Convex_hull)
3. What is a large margin classifier?
4. Why SVM is an example of a large margin classifier?
5. SVM being a large margin classifier, is it influenced by outliers? (Yes, if C is large,
otherwise not)
6. What is the role of C in SVM?
7. In SVM, what is the angle between the decision boundary and theta?
8. What is the mathematical intuition of a large margin classifier?
9. What is a kernel in SVM? Why do we use kernels in SVM?
10. What is a similarity function in SVM? Why it is named so?
11. How are the landmarks initially chosen in an SVM? How many and where?
12. Can we apply the kernel trick to logistic regression? Why is it not used in practice then?
13. What is the difference between logistic regression and SVM without a kernel? (Only in
implementation – one is much more efficient and has good optimization packages)
14. How does the SVM parameter C affect the bias/variance trade off? (Remember C =
1/lambda; lambda increases means variance decreases)
15. How does the SVM kernel parameter sigma^2 affect the bias/variance trade off?
16. Can any similarity function be used for SVM? (No, have to satisfy Mercer’s theorem)
17. Logistic regression vs. SVMs: When to use which one? (Let n and m are the number of
features and training samples respectively. If n is large relative to m use log. Reg. or SVM
with linear kernel, if n is small and m is intermediate, SVM with Gaussian kernel, if n is
small and m is massive, create or add more features then use log. Reg. or SVM without a
kernel)
18. What is the difference between supervised and unsupervised machine learning?
Ex:
Given any query point predicting its class label is straight forward.
Case 1: 99% y+ and 1% y-: Entropy = -0.99lg 0.99 – 0.01 lg 0.01 = 0.0801
Case 2: 50% y+ and 50% y-: Entropy = - 0.5 log 0.5 – 0.5 log 0.5 = 1
Entropy has maximum value when all classes are equally probable;
31.4 KL Divergence
Dist(P, Q): is small when two distributions are similar and closer;
KS statistic: Maximum gap between P’ and Q’: KS statistic is not differentiable thus
cannot be used as a part of loss function;
Summation or integration depends on type of random variable (discrete or continuous);
KL statistic is differentiable
IG(Y) != IG(Y)
Result:
Stopping condition for growth of decision tree: Pure nodes or depth, if pure node not
reached then use majority vote or mean;
Task: Split node: using Entropy or Gini Impurity and Information Gain
For Numerical features, sorting of data points is done based on feature values: and
numerical comparisons are done to split nodes;
Threshold depends on order and does not depend on actual values in Decision Trees, No
need for feature standardization
31.10 Building a decision tree: Categorical Features with many possible values
Ex: Pin code/ Zip code: Numerical but not comparable, thus these are taken as
categorical features and there could be a lot of categories in each feature; which will
result in data sparsity or tree sparsity
As depth increases, possibility of having very few data points at a node; possibility of
overfitting increases and interpretability of the model decreases; as depth is less
underfitting happens.
Decision Trees are suitable: Large data, small dimensionality, low latency requirement
Take weighted errors for splitting, whichever feature gives lowest error is selected;
31.14 Cases
Multi-Class Classification: One versus Rest is not required as Entropy takes all categories
while calculating them
Feature Interactions: Between depths, F1 < 2 AND F2 > 5 are available or logical feature
interactions are inbuilt to Decision Trees: Advantageous
Feature Importance: For every feature we get Information Gain or can sum up
reductions in Entropy due to this feature. The more reduction, the more important;
Link: http://scikit-learn.org/stable/modules/tree.html
For instance, in the example below, decision trees learn from data to approximate a sine
curve with a set of if-then-else decision rules. The deeper the tree, the more complex
the decision rules and the fitter the model;
1. You are working on a time series data set. You manager has asked you to build a high
accuracy model. You start with the decision tree algorithm, since you know it works
fairly well on all kinds of data. Later, you tried a time series regression model and got
higher accuracy than decision tree model. Can this happen? Why?(Refer
:https://www.analyticsvidhya.com/blog/2016/09/40-interview-questions-asked-at-
startups-in-machine-learning-data-science/)
2. Running a binary classification tree algorithm is the easy part. Do you know how does a
tree splitting takes place i.e. how does the tree decide which variable to split at the root
node and succeeding nodes?(Refer:https://www.analyticsvidhya.com/blog/2016/09/40-
interview-questions-asked-at-startups-in-machine-learning-data-science/)
https://vitalflux.com/decision-tree-algorithm-concepts-interview-questions-set-1/
Chapter 33: Ensemble Models
In Machine Learning, Multiple models are brought together to build a powerful model.
The multiple models may individually perform poorly but when combined they become
more powerful, this combination to improve performance of several baseline models is
called ensemble.
Key aspect: the more different the base line models are, the better they can be
combined
At run time, the query point is passed to all the models and a majority vote is applied
Bagging: Take Base models having low bias and high variance and aggregate them to get
a low bias and reduced variance model
Decision Trees with good depth are low bias and high variance models: Random Forest
Xn x d to Xm x d’ where m<n and d’ < d; m and d’ are also different between samples;
We will have k Decision Trees trained on these k samples and Aggregation (majority
weight or mean/median) is applied on the output of the k base learners (DTs);
The data points that are removed during selection for each sample (out of bag points)
can be used for cross validation for the corresponding model (out of bag error is
evaluated to understand performance of each base learner)
Random Forest: Decision Tree (reasonable depth) base learner + row sampling with
replacement + column sampling + aggregation (Majority vote or Mean/Median)
Tip: Do not have a favorite algorithm
These ratios are fixed initially and the number of base learners is
determined through hyper parameter tuning (cross validation)
K base learners;
Train time complexity: O(n * lg n * d * k); Trivially parallelizable, each base learner can
be trained on different core of the computer;
Code: sklearn.ensemble.RandomForestClassifier()
For numerical features we apply thresholds (using sorting) to get Information Gain;
Randomization is used to reduce variance; bias does not increase due to randomness;
33.8 Random Forest: Cases
Random Forest: Decision Tree + (Row samp. + Col samp + agg) (used to reduce variance)
3. Feature Importance:
In DT, Overall reduction in Entropy of IG because of this feature at various levels
in the DT (single DT)
In RF, overall reduction is checked across all base learners for feature importance
Bagging: High variance and low bias models: with randomization and aggregation to
reduce variance
Boosting: take low variance and high bias models: use additive combining to reduce
bias;
Given a dataset:
Stage 0: Train a model using the whole dataset: This model should have high bias
and low variance: DT with shallow depth; Large Train error;
Let the problem be a regression problem: The loss function is say squared loss
Link: https://youtu.be/qEZvOS2caCg
For non squared loss functions: Is pseudo residual equal to negative gradient?
Adaptive boosting: At every stage you are adapting to errors that were made before;
more weight is given to misclassified points
Individual classification models are trained based on the complete training set; then a
meta-classifier is trained on the outputs – Meta features of the individual classification
models in the ensemble. The individual models should be as different as possible.
Train all models independent of each other; given a data point query it is passed
through all models and all the outputs are passed through a meta-classifier;
Real World: Stacking is least used on real world problems due to its poor latency
performance
33.17 Cascading classifiers
Used when cost of making a mistake is high; Sequential model definition each model
working to classify datapointso ne after the other; whichever data points are perfectly
classified with high probability by a model are not shown to the next model; generally at
the end of all the models a human is placed to make predictions on the data points the
cascaded models are unsure about;
Due to this complex ensembles are generally experimented with and these models are
not useful for low latency, training time and interpretability;
34.1 Introduction
Data can be in the form of Text, categorical, numerical, time series, image,
database tables, Graph-data, etc.
Example: ECG
We can have daily shopping patterns or weekly shopping patterns or annual patterns
Given a composite wave: It can be decomposed into multiple sine waves, different sine
waves have different time period, and different peaks: this is transformed onto
frequency domain; This process is called as Fourier transformation;
For each problem: a specific set of features are designed prior to Deep Learning; A set of
features designed for a problem did not work for another problem
Color histogram: Each pixel has 3 color values; for each color plot a histogram (0 to 255)
and vectorize it; Different objects have distinct color attributes, sky and water come in
blue, human faces have cream, wood brown, metal grey, etc.
Edge histogram: At the interface of color separation; region based edge orientation;
Detects key points or corners of object and creates a vector for each key point;
Features: Number of paths different nodes of the Graph, Number of mutual friends
Height: If H<120 return 1 elif H<150 return 2 elif H<180 return 3 else return 4
Task: Predict gender given, height, weight, hair length, eye color
Using Decision Trees: Create new features using all leaf nodes;
X is a feature: log(X), sin(X), X2, etc. can be used to transform the features;
Say f1 is power law distributed: when using Logistic Regression: transforming with log is
beneficial;
3 features and a regression target: say we have f1 – f2 + 2 f3: Linear models are better;
Decision Trees may not work here
With Bag of Words: Linear models tend to perform very well (Lr SVM and Logistic
Regression);
The more different the features (Orthogonal relationship) are, the better the models will
be;
If there are correlations (less orthogonality) among the features, the overall prediction
performance of the model on the target will be poor;
Split data into feature values and build different models on the splits separately;
Split criteria: feature values are different and sufficient data points are available;
If:
then class 1
These probabilities may not be exact probability; values between 0 and 1 can be given
by sigmoid;
Calibration is required for computing the loss functions; these loss functions might
depend on exact probabilities;
Model f trained on a training dataset; For each data point this model will give an output:
In each chunk compute average of y_pred and y_true in the chunk; avg_y_pred
and avg_y_true
Task of calibration: make avg_y_pred proportional to avg_y_true; an ideal model
will result in a 450 line between avg_y_pred and avg_y_true;
Platt scaling works only if the Calibration dataset distribution is similar to sigmoid
function;
Works even if calibration plot does not look like sigmoid function; if we fit a sigmoid
function there will be a large error
Break the model into multiple linear regression models; piece wise linear models
Solve an optimization where minimize the gap between the gaps of the model and the
actual value;
Large data large Cross Validation dataset large Calibration data Isotonic Reg
Links: http://scikit-
learn.org/stable/modules/generated/sklearn.calibration.CalibratedClassifierCV.h
tml
http://scikit-
learn.org/stable/auto_examples/calibration/plot_calibration.html#sphx-glr-
auto-examples-calibration-plot-calibration-py
Determine outliers based on loss function and remove some outliers; repeat the
process until satisfied; Reduction of dataset at each stage with removal of
outliers based on random sampling, model training and loss computation;
Link: http://scikit-learn.org/stable/modules/model_persistence.html
2. Custom implementation:
For low latency applications; Use C/ C++ code for faster predictions;
Store weights using data structures, Using CPU cache; RF and DTs with if
else statements
Model need to be retrained periodically to gather current trends and also to get
around with new companies, etc.
Example: With respect to curing cold: If a new drug is found and need to be
experimented for its effectiveness: A part of group of patients is administered with the
new drug and another part of the group is administered with the old drug. Person with
old drug/data are called control group and the group with new drug is called treatment
group. The control and treatment groups are called A and B groups.
7. Deployment
11. Optimization: improve models, more data, more features, optimize code
35.11 Productionization and deployment of Machine Learning Models
35.13 Hands on Live Session: Deploy and ML model using APIs on AWS
35.14 VC dimension
VC dimension is mostly used for research work not generally found in applied work;
VC dim (linear model) = maximum number of points that can be shattered by a linear
model for all possible configurations = 3 (that are not collinear)
Theoretically, RBF SVM is powerful than all models as its VC dimension is infinity;
Example:
Similarity: Points are close to each other in the same cluster and very far in different
clusters
For classification and regression we had metrics such as precision, MSE, etc.
Classification and Regression: Supervised Learning: We had target variable to train the
models
Semi- Supervised Learning: Dataset which has a small size of supervised data points and
unsupervised data points;
43.3 Applications
Data Mining;
Ecommerce (group similar customers based on their location, income levels, purchasing
behavior, product history)
Review analysis (text): Manual labeling is time consuming and expensive; Clustering can
be applied: into 10k groups based on word similarities syntactic and semantic similarity),
review and label each cluster (pick some points and check for labels), Pick points that
are closer to cluster centre and avoid outliers
Intra cluster should be small and inter cluster should be large: This is how we can
effectively measure the performance of the clustering algorithm
Dunn index:
For every cluster K Means assigns a centroid and groups the data points into clusters
around the centroid, no point belongs to two clusters and every data point belongs to at
least 1 cluster;
Lloyd’s Algorithm:
1. Initialization: Randomly pick k data points from the Dataset and call them centroids
2. Assignment: For each point in the Dataset, select the nearest centroid through the
distance and add this data point to the centroid corresponding cluster
3. Re-compute centroid: Update centroid to the mean of the cluster.
K-Means have initialization sensitivity: the final result changes when the initialization is
changed
Link: https://cs.wmich.edu/alfuqaha/summer14/cs6530/lectures/ClusteringAnalysis.pdf
Repeat k means with different initializations: pick best clustering based on smaller intra
distances and larger inter cluster distances;
Clusters with different sizes, densities and non-globular shapes and presence of outliers
Different densities:
Use larger k to get many clusters and put together different clusters and avoid above
problems;
43.10 K-Medoids
Centroids may not be interpretable for example bag of words vectorization of text;
Instead of giving centroids computed using means, if we output an actual data point
that is just a review will be more interpretable; this review is a Medoid;
sklearn.cluster.KMeans
Agglomerative clustering: Takes each data point as a cluster and groups two nearest
clusters into one cluster until the number comes to k; Stage wise Agglomeration of
clusters;
Divisive starts in the reverse order: Takes all data points into 1 cluster and divides
clusters stage be stage; division is a big question; Agglomerative is popular;
Dendrogram
Link: https://cs.wmich.edu/alfuqaha/summer14/cs6530/lectures/ClusteringAnalysis.pdf
Due to similarity presence in the calculations: Min, Max, Avg can be kernelized
Ward’s Method: squared distances are taken; everything else is similar to Group Avg
44.4 Time and Space Complexity
Time: O(n3)
sklearn.cluster.AgglomerativeClustering
Hierarchical: Agglomerative
DBSCAN: Dense regions: Clusters, Sparse: Noise: Separate dataset into dense
regions from sparse regions
Dense region: Region which has minimum number of points MinPts in radius eps
Border point: point having <= Min pts in eps radius and belongs to a neighborhood of
core point
Density edge: Connection between two core points which are in a radius <= eps
Density connected points: between two core points if there exists all density edges exist
in the path between the core points
1. For each data point: Label each point as core point, border pt or a noise pt
(Given Minpts and Eps); This is done using Range Query (using kd-tree or kNN)
Add all pts that are density connected to p into this new cluster
Important operation: RangeQuery: returns all data points that are in eps radius
MinPts: >= d + 1, typically 2*d; for d dimensionality, larger Min pts takes care of noise
Eps: For each data point calculate distance from kth nearest point where k = Min pts
Sort these distances and plot to get an elbow point
DBSCAN: resistant to noise, can handle different sizes of cluster, does not require
n_clusters to be specified;
DBSCAN has trouble with: Varying densities and high dimensionality data, sensitive to
hyper parameters: depends on distance measure which causes curse of dimensionality
Space: O(n)
45.9 Code Samples
sklearn.clister.DBSCAN()
Recommend relevant items to a user based on historical data of the item and the user;
Given dataset A:
Cell values can be ratings or the usage of the item by the user
Cell values which do not have values are left as nan rather than replacing with 0
Matrix A is very sparse as each user can use a small set of items;
Given a user and his history of item usage, recommend a new item that he will
most likely use
Fill empty cells in the sparse matrix with non empty cell values;
Collaborative Filtering:
U3:- M1
Content based: uses rating information or the usage matrix values as target variable;
Uses representation of the item and the user (features), such as his
preference, gender, location, item type, movie type, movie title, movie
cast, etc
User-User similarity:
Given matrix with every row of the matrix is a user vector and every column is an
item;
Build user-user similarity matrix; Say we have U1, U2, and U3, who are most
similar to U4, we can recommend items that U4 has not used yet and that U1, U2
and U3 have used already and recommend these items to U4
Item-item similarity:
Each item is a vector; and a similarity matrix is build using similarity between
items;
Ratings on a given item do not change significantly after the initial period;
If we have more users than items and when item ratings do not change much
over time after initial period, item – item similarity matrix is preferred
46.4 Matrix Factorization: PCA, SVD
A = B*C*D = P*Q
2. Compute B and C
3. Matrix completion: Fill the empty cells with B and C
Row vector of B matrix above can be used as useri vectorization and from C we
can have itemj vectorization;
The d-dim representation arrived at using Matrox Factorization: if two users are
similar then the distance between vectors will be small, similarly for items;
Word vectors
Find k- cluster centroids and corresponding sets of data points; Such that every
data point belongs to only one set and the distance from the data points to the
cluster centroid is minimum;
Define a matrix Z such that Zij = 1 if xj belongs to Set Si else 0; The Matrix Z is sparse and
can be said to be an assignment problem;
If X is decomposed into C and Z through Matrix Factorization:
d-dimensionality is a Hyperparameter;
1. Problem specific
2. Systematic way:: Optimization: min (A – BC)2
Error plot with d dimensionality is generated and an elbow point is selected
Link: https://datajobs.com/data-science-repo/Recommender-Systems-[Netflix].pdf
If there is a new user joins the system or a new item that is added, then there is no
ratings data;
Recommend top items based on meta-data such as geo location, browser, and device;
Word2Vec: Inspired by Neural Networks, ex: LSA< PLSA< LDA< GLOVE can be
interpreted as Matrix Factorizations
1. Co-occurrence matrix:
D = {text documents}
Compute X matrix:
Using co-occurrence matrix and applying truncated SVD over the matrix we will get U
matrix. From this U matrix which is of (nxk) shape, we have each row as vector
representation of each word of k dimensionality
Instead of taking all words we can have top words that have good importance;
46.13 Eigen-Faces
Image data: Eigen faces for face recognition (PCA) to get feature vectors; (CNN replaced
all techniques for image tasks)
Link: https://bugra.github.io/work/notes/2014-11-16/an-introduction-to-unsupervised-
learning-scikit-learn/
Link: http://scikit-
learn.org/stable/auto_examples/decomposition/plot_faces_decomposition.html#sphx-
glr-auto-examples-decomposition-plot-faces-decomposition-py
Matrix construction with stacking images row wise where each row in the matrix
contains image data which is flattened into a single vector;
From this matrix: Co-variance matrix is computed; dimensionality reduction is applied
on this co-variance matrix through Matrix Factorization;
Multiply row wise images stacked matrix with the top k left singular vectors or column
wise stacked Eigen vectors of top k eigen values of the co-variance matrix; This
multiplication result is the Eigen Faces
Through Eigen values we can compute % of explained variance (ratio of Eigen values) to
get good k;
sklearn.decomposition.TruncatedSVD()
sklearn.decomposition.NMF()
https://stats.stackexchange.com/questions/134282/relationship-between-svd-and-pca-
how-to-use-svd-to-perform-pca
Chapter 49: Case Study 9: Netflix Movie Recommendation System (Collaborative based
recommendation)
____________________________________________________________________________
A biological Neuron has Nucleus, Dendrites and a cell body. When a Neuron gets electrical
signals as inputs, the neuron does computations inside and it sends the processed electrical
signals as output maybe to other neurons.
An Artificial Neuron has some inputs that are important, thus weights are included on edges of
the connections.
Perceptron is this single neuron which is loosely inspired from the biological neuron and is not
an exact replica but is powerful enough to solve many interesting problems.
In biology, a neuron does not exist on itself. It is connected to other neurons. A structure of
neurons can be considered for imagination (network). First successful attempt was made in
1986 by a group of mathematicians (Hinton and others). They came up with backpropagation
(chain rule around differentiation). A lot of hype has been generated.
Unfortunately around 1996, due to insufficient computational power, insufficient data and lack of
algorithmic advancements AI experienced a long winter. This is shortly called AIWinter. Funding
for AI got exhausted due to hype. Neural Networks couldn’t take off in the 90s. People shifted to
SVM, RandomForest and other GBDTs between 1995 to 2009, which were giving solutions to
many problems.
Hinton in 2006 released a paper on how to train a Deep Neural Network. Before this NN was
limited to a small number of layers. As the number of hidden layers increased backpropagation
failed.
DNN took developments with a competition on ImageNet Dataset. The task was to identify
objects in images. DNN has performed very well by a large margin on this dataset compared to
other classical ML algorithms.
Noticeable Applications: Siri, Cortana, Alexa, Google Assistant, Self Driving Cars, Health care
purposes
The development is also driven by the availability if Data, computational power and new
algorithms.
50.2 HOW BIOLOGICAL NEURONS WORK?
- Biological Neuron:
Biological Neuron
Lof of internal chemical activity takes place in a biological neuron, but we can understand its
working with a simple structure (as above). We have biochemistry generated electrical signals.
Each neuron gets connected to a bunch of other neurons. Some dendrites are thicker. This
leads to more weight for that input. A neuron is activated or fired if there is enough input.
The main purpose of the activation function is to add non-linearity in the network so that the
model can learn difficult objective functions.
In ensembles we are training different models and we are combining them on specific
conditions. In NN, all neurons are learning at the same time based on the loss function we have.
Generally we use a nonlinear activation function. For regression problems we use a linear
activation function in the last layer.
Let us consider the growth of a neural network in the human brain. At birth, there are far
few connections between neurons. Missing connections can be thought of having weights equal
to 0. At age 6, the network gets dense, i.e. weights get trained. At age 60, weights or
connections of some edges disappear or become thin, this process is termed as Neural
degeneration.
By age 6, humans learn: language, object recognition, sentence formation, speech, etc.
(massive amounts of learning). Biological learning is basically connecting neurons with edges.
New connections are formed based on data (not random).
y_i_hat = sigmoid(W.T*x_i + b)
A perceptron can also be understood as a linear function trying to find a line that separates two
classes of datapoint.
Bunch of single neurons stacked to form a layer and layers are stacked to form a
network of neurons.
This can be thought of as a function of functions. Thus with MLP we can have complex
functions to act on x to get y. Having MLPs we can easily overfit, thus regularizers are applied to
avoid overfit. MLP is a graphical way of representing functional compositions.
50.6 NOTATION
Let D = {x_i, y_i}; x_i belongs to R(4) and the problem is a regression problem
f(z) = z
Stochastic Gradient Descent → Compute ∇wL using all x_i and y_i
D = {x_i, y_i}
b. (η = Learning rate)
c. Perform Updates (step 2) upto convergence
(When computing gradients for backpropagation using chain rule, consider path of flow
of the feature)
In Computer Science we have a powerful idea called as Dynamic Programming. This helps us
calculate the value of a variable only once. Compute anything only once and store that in a
dictionary for reuse.
While computing gradients, gradients such as can be seen to occur multiple times.
Without storing its value we end up re-calculating its value again and again. This will impact the
time of computation of all the gradients in an MLP. With memoization we calculate the value of
each and every gradient only once. This will avoid repeated calculations while keeping run time
minimum.
50.10 BACKPROPAGATION.
Given D = {x_i, y_i}
1 epoch of training → pass all of the data points through the network once.
Backprop: ‘init weights’, ‘in each epoch (forward prop, compute ;pss, compute derivative(chain
rule + memoization), Update weights’, ‘repeat till convergence’
tanh:
1.
→ differentiable
ReLU became most popular activation function as sigmoid and tanh resulted in vanishing
gradients
Due to chain rule, multiplication of derivatives which are <1 will result in overall derivative
to be small.
→ W_new ~ W_old
(With sigmoid function, which results in values between 0 and 1, exploding gradients occur
when weights are greater than 1)
1. As number of layers increase, we will have more weights, leading to overfit → low bias,
high variance
2. Logistic Regression model - fewer weights (compared to MLP) - leads to underfitting -
high bias
- MLPs typically overfit on train dataset
- We avoid this using Regularizers (L1 or L2 or Dropouts)
- We will have ‘lambda’ coefficient of regularization term as a hyperparameter.
Also number of layers is a hyperparameter
- Using regularization, we reduce variance
Until 2010, people were trying to build 2 to 3 layered networks due to: vanishing
gradients, limited data (easy to overfit, no weight updates -> no training), limited
compute power.
By 2010 we got lots of data (Internet, labelled data (quality)), compute power has
increased (Gaming GPUs - NVIDIA - found to be suitable for Deep Learning),
advancements in algorithms. This paved way for modern Deep Learning achievements.
With classical ML (Mathematician approach), theory was first built and then proved
through experiments. With DL it became possible to experiment (cheap) with different
ideas first (Engineer Approach) and then develop theory.
Random Neuron Dropout per epoch with a dropout rate (percentage of neurons
dropped)
At training time neuron is present with probability ‘p’, at test time all neurons are present
but to account of probability of presence, multiply weights of dropout with p;
High dropout rate -> low keep probability -> underfitting -> Large regularization
Low dropout rate -: high keep probability -> overfitting -> small regularization
●ReLU Advantages:
○ Faster convergence
○ Easy to compute
● ReLU limitations:
○ Non-differentiable at zero
○ Unbounded
○ Dying relu
○ Non-zero centered
a. Why non-linear activations?
If the activation functions are linear, a deep neural network will be similar to a
single layer neural network which cannot solve complex problems. While non-
linearity gives a good approximation of underlying relations.
b. When do we use sigmoid function?
Generally, at the last output layer in case of binary classification problems,
Softmax in output layer is used for multi-class classification problem, linear for
regression.
a: Normal distribution
● Weights should be small (not too small)
● Not all zero
● Good-variance (each neuron will learn differently)
● Weights are random normally initialized
b: Uniform initialization:
● Weights are initialized to a Uniform distribution
○ Unif[-1/sqrt(fan_in),1/sqrt(fan_in)], selection of each value is equally likely
c: Xavier/Glorot init (2010) - useful for sigmoid activations
● Normal: Initialize to a mean centered normal distribution with variance (sigma
sq.) = 2/(fan_in + fan_out)
● Uniform: Initialize to
○ ~Unif[(-sqrt(6)/sqrt(fan_in+fan_out), sqrt(6)/sqrt(fan_in+fan_out)]
● Using fan_in and fan_out also
d. He init - ReLU activations
● Normal: Initialize to a mean centered normal distribution with variance (= sigma
sq.) = 2/(fan_in)
● Uniform: ~Unif[(-sqrt(6)/sqrt(fan_in), sqrt(6)/sqrt(fan_in)]
Case 1: W = Scalar
- If L(w) is y = x**2/4a:
- If L(w) is a complex:
Saddle point: gradient is also zero, but not a minima nor a maxima
- Squared loss is used as its derivative is a linear function which will have a unique
solution, while derivatives of higher order loss functions will lead to having more than
one solution or multiple minima or maxima locations, Squared loss can generally be
preferred as it will only have one minima or maxima
Loss function of DL models: dice loss, cross-entropies, MSE. These are defined by the
programmer, mostly non-convex functions are used as their performances are better and
the choice depends on architecture and the network output or our desired target values.
Loss functions represent the Network’s output function.
(Graphically shown)
Simple SGD will get stuck at saddle point, local maxima and local minima
51.8 SGD RECAP
grad(): mini-batch SGD ~ GD (mini batch SGD approximation to GD, and is compute
efficient) (mini-batch SGD is a noisy descent)
With SGD we have gradients at each time step as: a1, a2, a3, …
a1, r*a1+ a2, r*( r*a1 + a2) + a3, …..: a.k.a exponential weighted average
NAG: First move in momentum direction, then compute gradient and then move
in gradient direction: W_t = W_t-1 - gamma * V_t-1 - eta*g’
NAG update ends up slightly closer to the optimum, convergence becomes faster
compared to SGD + momentum.
51.11 OPTIMIZERS:ADAGRAD
Advantage:
Disadvantage:
Link: https://youtu.be/c86mqhdmfL0
Sparse features makes gradient small (due to summation on all data points)
Update does not happen, thus eta or the Learning rate requires to be larger for
these sparse features, while for dense feature gradient will not be small. Simply
put, constant learning rates will only help the weights to update with contributions
from dense features and not from sparse features. Weights updates for sparse
features will be negligible as gradients will be small which demands for a higher
learning rate. If a constant higher learning rate is used, the weights are updated
with large steps and convergence will never happen. Thus we require different
learning rates. Sparse features demand large learning rates, while dense
features require smaller learning rates.
Previous Lecture: Adagrad’s alpha can become very large resulting in slow
convergence.
eda(0) = 0
51.13 ADAM
t = time step
51.14 WHICH ALGORITHM TO CHOOSE WHEN?
Type of Optimizers:
Mini batch SGD, NAG, Adagrad, Adadelta, RMSProp, Adam we came across
Source: Link
Considering getting stuck at saddle points or at local minima or local maxima (loss does
not change):
When we train MLPs, we need to monitor Weight updates. Thus we need to check
gradients (each epoch and each weight). This will help us detect vanishing gradients
which easily occurs in the first few layers due to farness from the output layer (gradients
become too small or very large as we move far from the output layer towards input
layer).
For exploding gradients: we can use gradient clipping: All the weights or gradients are
stored in a single vector.
All the gradients in the gradient vector are divided by the vector’s L2 norm (sum
of squares of all gradients). This will clip the gradients to 1. And multiplying these with a
threshold value we clip all the gradients to the threshold value.
For multi-class classification using Logistic Regression we use One versus Rest method.
But can we do something else, like extending the basic math of Log. Reg. Extension of
Logistic regression to Multi-Class classification results in softmax. Recap: Output of the
Logistic regression network is the probability of y_i = 1 given x_i.
Summation = 1
In Logistic Regression, we optimize log loss for binary classification,
y_ij is 0 for all classes other than true class which has y_ij = 1
Regression: Squared Loss, 2 class classification: 2 class log loss, k-class classification: Multi
class log loss
Neural network which performs dimensionality reduction (better than PCA and tSNE
sometimes) (tSNE tries to preserve the neighborhood)
Given x_i: as we reduce the number of neurons in the next layer dimensionality reduces.
Let us take a simple encoder with three layers: Input Layer: 6 neurons, Hidden: 3
neurons, Output: 6 neurons
Thus if we have input ~ output in an auto encoder, then we have the compression part
reducing the dimensionality without losing variance or information.
Denoising autoencoder: Even though we have noise in input, at the output we will get
denoised representation of the input as the dimensionality is reduced in the intermediate
hidden layers.
If linear activations are used or a single sigmoid hidden layer, the optimal solution to an
autoencoder is closely related to PCA.
Focus word: sat: The, cat, on, the, wall are context words
If Focus word: cat: The, sat, on, the, wall are context words
Idea: context words are useful for understanding focus words and vice-versa.
Structure:
Take all of the text dataset, create focus word - context words dataset, train the
neural network with above structure on this dataset.
Input: v-dimensional one hot encoded focus word. Hidden layer with N dimension with
linear activations. Use multi output Softmax layers which gives context words as output
corresponding to each output layer that are stacked to form a single output layer (k
softmax layers).
Negative sampling: Update a sample of words: Update weights to all the target words
and weights of some of non target words (selection of non target words is based on
probability value corresponding to the occurrence of the word).
52
DEEP LEARNING: TENSORFLOW AND KERAS.
____________________________________________________________________________
____________________________________________________________________________
Tensorflow/ Keras: Tool kits or libraries that enable us to code for Deep Learning
- Helps researchers, for developers and for deployment engineers also (two
different tasks)
- Core of Tensorflow was written in C/C++ for speed, they made interfaces
available in Java, Python and JavaScript
- Flow may be inspired from forward and backward propagation (flow of data)
- With Keras: simple (similar to SKLearn); high level NN library: faster for
developing and deploying models; less lines of code
____________________________________________________________________________
____________________________________________________________________________
RAM is connected to CPU through motherboard, Cache is like RAM on chip, The processors
can communicate with Cache faster than with RAM, With multiple cores multiple calculations
are done at time parallely. Around 2GHz speed
GPU: Multiple processors: 1024, 512, etc.Each GPU core is slower than a CPU core, around
400MHz speed. Every unit has a processor and a cache.This is a distributed structure. Sum of
all cache in GPU is called as VRAM (Video RAM). Takes data from RAM, distributes across all
units, and the processor unit works very fast on its corresponding Cache data.
GPUs are fast if we have parallelizable tasks such as doing Matrix operations as each of the
calculations are not dependent among themselves. Because of these characteristics of GPUs,
Deep Learning is experiencing developments.
____________________________________________________________________________
____________________________________________________________________________
Link: colab.research.google.com
Cloud computing: shared computing resources; bunch of computers tied together for ready
access over internet
____________________________________________________________________________
____________________________________________________________________________
https://www.tensorflow.org/install/install_windows
____________________________________________________________________________
____________________________________________________________________________
https://www.tensorflow.org/get_started/
https://learningtensorflow.com/
https://cloud.google.com/blog/products/gcp/learn-tensorflow-and-deep-learning-without-
a-phd
____________________________________________________________________________
____________________________________________________________________________
Using Tensorflow:
A placeholder can be imagined to be a memory unit; where values of some data points
are stored for computations.
o W = tf.variable(tf.zeros([784,10]))
o b = tf.variable(tf.zeros([10]))
o y = tf.nn.softmax(tf.matmul(x,W)+b)
o y_ = tf.placeholder(tf.float32, [None.10])
o tf.global_variables_initializer().run()
o Metrics evaluations
____________________________________________________________________________
a. Importing libraries
____________________________________________________________________________
____________________________________________________________________________
https://drive.google.com/file/d/1tIEOPJiJtMzStFai47UyODdQhyK9EQnQ/view
____________________________________________________________________________
____________________________________________________________________________
1. Definition of keywords
Hyperparameter: A parameter of learning algorithm that is not of the model, these remain
constant during model training. The values of other parameters are derived
from training. Example of hyperparameter: number of neurons in each layer.
Example of model parameter: learning rate. Given hyperparameters the training
algorithm learns the parameters from training.
https://en.wikipedia.org/wiki/Hyperparameter_(machine_learning)
Hyperparameter tuning: Through tuning we derive a tuple of hyperparameters that yields an
optimal model which minimizes a predefined loss function on given test data. It
is the procedure through which we derive hyperparameters that yields an
optimal model. This model is further trained on training set to update weights.
2. Detailed explanation
Neural Networks are so flexible that they a drawback in terms of hyperparameters which can be
tweaked. A simple MLP can have lots of hyperparameters: Number of layers, number of neurons per
layer, type of activation function to use in each layer, weights initialization logic, and so on. One
option is to try different combinations of hyperparameters and see which combination works best
on validation set. Use SKLearn’s GridSearchCV or RandomizedSearchCV. Wrap keras models to
mimic regular SKLearn classifiers. SKLearn tunes hyperparameters for maximum score by default.
Thus loss functions needs to be transformed into a metric. This way of tuning is generally time
consuming. Efficient toolboxes such as hyperopt, hyperas, skopt, keras-tuner, sklearn-deep,
hyperband and spearmint are available for this purpose. As the search space is huge, follow the
following guidelines to restrict search space:
For number of hidden layers: Slowly add number of hidden layers from 1 to 50 or 100 until
overfitting on training set. Above this use, transfer learning.
Number of neurons per layer: For input and output layers it is predetermined by data. Randomly
pick a number which is power of 2. Use early stopping and regularization to prevent overfitting.
It is generally preferred to increase number of layers over number of neurons per layer. Avoid
having too few layers to have enough representational power of the data.
Learning rate of loss function: Change learning rate from 10-5 to 101 at an interval of epochs. Plot
loss function and select the LR that is just behind the value which shows minimum loss value. If
minimum is found at 1 use 0.1 as LR.
Optimizers, batch_size and activation functions have a fixed set of choices. While with number
of epochs use a large number and use early stopping to stop training.
Especially if after choosing LR, you tweak a hyperparameter, LR should again be tuned. A best
approach is to update Learning Rate after tweaking all other hyperparameters.
3. Video Lecture
Multi Layered Perceptrons have a lot of hyperparameters.
a. Number of Layers
b. Number of activation units in each layer
c. Type of activation: relu, softmax
d. Dropout rate, etc.
How do you do hyperparameter tuning for MLP?
Scikit-learn has two algorithms for this purpose: GridSearchCV and RandomSearchCV, we
used them a lot in Machine Learning assignments
Keras models require to be connected with the above algorithms. Build DL models in keras
and use SKLearn’s hyperparameter tuning algorithms
Code: Procedure:
# Hyper-parameter tuning on type of activation of Keras - We are tuning only activation type
models using Sklearn (comparing Relu and Softmax)
- Assume: all important libraries are
from keras.optimizers import Adam,RMSprop,SGD
downloaded, datasets loaded and
def best_hyperparameters(activ): everything is ready for
‘’’ hyperparameter tuning
- Model is defined using a function
Defining model: This function returns a model best_hyperparameters(). This is a
with activation type = input string sequential model. We are using
MNIST dataset which has 784
Input: string columns. This model has 4 layers.
Output: model Input layer has 784 neurons. There
are two hidden layers with 512 and
‘’’
128 neurons respectively. The last
model = Sequential() layer is the output layer with number
of neurons = output_dim. Type of
model.add(Dense(512, activation=activ,
Activation is the input to this
input_shape=(input_dim,),
function.
kernel_initializer=RandomNormal(mean=0.0, - Model is compiled with
stddev=0.062, seed=None))) categorical_cross_entropy with
metrics = accuracy and optimizer =
model.add(Dense(128, activation=activ, ‘adam’
kernel_initializer=RandomNormal(mean=0.0, - The actication choices are softmax
and relu. Softmax is just a multi-
stddev=0.125, seed=None))) valued version of sigmoid.
model.add(Dense(output_dim, activation='softmax'))
model.compile(loss='categorical_crossentropy',
metrics=['accuracy'],
optimizer='adam')
return model
4. Summary of comments
a. It is softmax and relu the choices for activation function.
b. Keras comes with the new hyper parameter tuning library. It is better than
keras.wrappers.
Please go through the video of sentdex and documentation
https://keras-team.github.io/keras-tuner/
https://github.com/keras-team/keras-tuner
https://www.youtube.com/watch?v=vvC15l4CY1Q
c. In classical ML we plotted the train acc. and cv acc. both to check overfitting of our
model. While in GridSearchCV or Talos, we just get the parameters for which CV acc. is
highest. Then, how are we sure that we're not overfitting?
I mean suppose I get a hyperparameter set (h1) from GridSearch with highest cv acc. = 86
% but the train acc. for h1 = 99%. Now, if another hyperparameter set (h2) exists such
that cv acc. = 84% and train acc. = 87%.
Now, acc. to what we learnt in classical ML, h2 is a better set of hyperparameters as cv
acc. of h2 ~ cv acc. of h1 but the difference between the train and cv acc. for h2 (3%) is
much less than that for h1 (13%). Hence, h2 performs better generalization on unseen
data. But, from clf.best_estimators() we're going to get h1 as it has the highest cv acc.
How to deal with this problem?
Ans. You have to consider that hyperparameter which gives highest CV accuracy in cross
validation. Whichever gives highest CV score, that will be the best
5. Additional material
a. https://github.com/autonomio/talos
b. We have one of our students build a case-study and write a blog on this exact topic. It's a
two-part blog that you can read here:
Part I: https://towardsdatascience.com/a-guide-to-an-efficient-way-to-build-neural-
network-architectures-part-i-hyper-parameter-8129009f131b
Part II: https://towardsdatascience.com/a-guide-to-an-efficient-way-to-build-neural-
network-architectures-part-ii-hyper-parameter-42efca01e5d7
c. About Hyperparameter Tuning in Keras:
https://towardsdatascience.com/hyperparameter-optimization-with-keras-
b82e6364ca53
d. https://towardsdatascience.com/hyperparameter-optimization-with-keras-b82e6364ca53
e. https://machinelearningmastery.com/grid-search-hyperparameters-deep-learning-models-
python-keras/
6. QAs
a. What can go wrong if you tune hyperparameters using the test set?
b. Can you list all hyperparameters you can tweak in a basic MLP? How could you tweak these
hyperparameters to avoid overfitting?
53
DEEP LEARNING: CONVOLUTIONAL NEURAL NETS.
____________________________________________________________________________
____________________________________________________________________________
Most popular for visual tasks: images; example: MNIST, Object Recognition...
They found that a certain neurons get fired to certain image features; in human brains certain
visual areas have specialized functional properties.
Some neurons in the visual cortex fire when presented with line at specific orientation
Different regions of the brain are responsible for edge-detection, motion, depth, color,
shapes, faces
____________________________________________________________________________
____________________________________________________________________________
Primary visual cortex performs Edge detection. Here we are going to understand CNN through
edge detection.
Say we have a 6x6 matrix with 3 rows black and 3 rows white. Thus we have an edge at third-
fourth row interface. Grayscale image pixel intensities - 0: black and 255: white.
*
Sobel/edge detection (horizontal)
Conv.
With convolutions, we have element wise multiplications and summation over them to get an
element for a new matrix. The scope of the multiplications is determined by the size of the filter.
We will have a new matrix of size (n-k+1, n-k+1): n - size of input matrix, k – size of filter matrix
The above Sobel filter is used for detecting horizontal edges. Transpose of this matrix can be
used to detect vertical edges.
Link: https://en.wikipedia.org/wiki/Sobel_operator
Such distinguished filters are applied to get different features of images for visual tasks in CNN.
Convolution: Element wise multiplications and addition, dot products on matrices (generally)
____________________________________________________________________________
____________________________________________________________________________
In previous topic, we had an input image of size 6x6 and the output array is of size 4x4. But if
we want to have output array of a different size, we should go for padding which will add a row
and column, top, bottom, left and right. But zero padding will generate extra edges. We can
have same value padding. With padding of size p, we will have final matrix of size
(n-k+2p+1, n-k+2p+1).
Strides will help us skip rows and columns by a value equal to strides (s). We will have an
output matrix of size (int( (n-k)/s) + 1, int( (n-k)/s) + 1)
____________________________________________________________________________
____________________________________________________________________________
It can also be thought of having three images stacked one over the other; resulting in a 3D
Tensor; these multiple images are called as channels, So each image will have n x m x c size
where c is the number of channels; Convolution filter will be a 3D tensor; care should be taken
for image processing that the number of channels in filter will be same as the number of
channels in the image; Convolution on a 3D image (n x n) with a 3D filter (k x k) results in a 2D
array (output image) of size (n-k+1, n-k+1)
____________________________________________________________________________
____________________________________________________________________________
Link: http://www.iro.umontreal.ca/~bengioy/talks/DL-Tutorial-NIPS2015.pdf,
http://cs231n.github.io/convolutional-networks/#pool
Recap: In MLP, we have weights; input and bias values, the dot product of these values are
filtered through an activation function
At each convolution layer we will have multiple kernels to learn different features of the
images. For each kernel we will have a 2D array output, multiple kernels result in multiple
output arrays (padded to get input array size), so at the layer we will get an output array of
size n x n x m, where m is number of kernels (m is a hyper parameter)
For every element in the output of filtering, the activation function is applied
Pad, convolve and activate to transform an input array to an output array in a convolution
layer
Multiple layers of convolutions are used, at each layer we will extract features:
MLP and convolution layers have similarity in terms of weights: kernels, while we train the
models to learn weights in MLPs, we train the models to learn kernels in Conv Nets
____________________________________________________________________________
53.6 MAX-POOLING.
____________________________________________________________________________
Pooling introduces: Introduces small amount of Location invariance, Scale invariance and
Rotational invariance
Pooling subsamples input arrays to reduce computational load, memory usage and number of
parameters (limiting the risk of overfitting)
We can also have a goal of equivariance, where a small change in input array should also
reciprocate a small change in output (invariance: a change in input image does not show
changes in output array)
Global Average Pooling: computes mean of entire map; gives out a single scalar
____________________________________________________________________________
Max pooling back propagation: gradient propagation to only the maximum value, say from a 2x2
matrix we max pool to get a 1x1 scalar, when back propagate through max pooling gradient in
the 2x2 matrices are either 0 or 1, it 1 where the maximum is present and 0 everywhere else.
Gradient = 1 because the same value is pulled as output (maximum valued element and non-
max values have no effect on the output).
____________________________________________________________________________
Receptive field: when passing the image through a filter at any time the pixels at which the filter
is being applied is the receptive field at that instant.
Effective Receptive field: In a Deep CNN, at a deeper layer a filter focuses on the pixels of input
image. This region of focus on which layers of convolutions are applied is an effective receptive
field. In layer 2, the filter has receptive field in the output of first layer and effective receptive
field in the input image.
____________________________________________________________________________
____________________________________________________________________________
Link: https://world4jason.gitbooks.io/research-
log/content/deepLearning/CNN/Model%20&%20ImgNet/lenet.html
Concepts of techniques are old but we didn’t have datasets and compute power;
LeNet: Small depth
____________________________________________________________________________
____________________________________________________________________________
Link: https://en.wikipedia.org/wiki/ImageNet
Contributed the most to Deep Learning; This dataset became the benchmark dataset for all new
DL algorithms
____________________________________________________________________________
____________________________________________________________________________
We want CNN models to be robust to input image changes, such as translation, scaling, mirror,
etc. Thus, the input image is passed through data augmentation generating new images. For
example: (using matrix transformation)
Link: https://github.com/albumentations-team/albumentations
Can introduce invariance in CNN, create a large dataset when we have small datasets.
____________________________________________________________________________
____________________________________________________________________________
Links: https://keras.io/layers/convolutional/
https://keras.io/layers/pooling/
https://keras.io/layers/core /
____________________________________________________________________________
53.13 ALEXNET
53.14 VGGNET
____________________________________________________________________________
LeNet: 2 Conv, 2 Mean Pool, 3 FC – 1000s trainable params, Sigmoid Activation: First
Architecture for Hand written images classification
AlexNet: 5 Conv, 3 Pool, 3 FC – 10M trainable params, ReLU activation, Includes Dropouts:
Trained on ImageNet
VGGNet: (2Conv+1MaxPool)*2 + (3Conv+1MaxPool)*3 + 3FC + Softmax
Additional:
- The Softmax classifier gets its name from the softmax function, which is used to
squash the raw class scores into normalized positive values that sum to one, so that
the cross-entropy loss can be applied
____________________________________________________________________________
____________________________________________________________________________
ResNets: Residual Networks: Regular networks as depth increases both training error and test
error are increasing. This was tackled using ResNets. Skip connections: Output of previous
layer is given as input to the next layer also in addition to giving it as input to the current layer.
ReLU(x) = x
So as increasing number of layers has an effect of increasing error, ResNets skip connection
concept will ensure that some of the layers that are useless will be neglected (type of
regularization). With ResNets we can add additional layers such that performance does not get
effected. If the new layers are useful, performance will increase as skipping will not happen.
Input array dimensions should match while using skip connections. ResNets are used to avoid
reduction of performance when number of layers increases.
____________________________________________________________________________
____________________________________________________________________________
Inception Network: Using multiple filters at a layer and stacking output of each filter to result in a
single output. This will help us take advantage of different filters or Kernel sizes. Number of
computations is large for each filter. As we increase number of kernels at each layer we will
have very large computations. It is of the order of Billion computations at each layer. Having an
intermediate layer of size 1x1 before each filter will reduce the number of computations.
http://www.ashukumar27.io/CNN-Inception-Network/
____________________________________________________________________________
____________________________________________________________________________
Transfer Learning: Utilizing performance of an already trained neural network to work on a new
dataset instead of building an NN from scratch to solve the Visual tasks. Pre-trained models are
readily available on Keras and Tensorflow. The pre-trained model can be trained on one dataset
and can be applied on to another related dataset. A model trained on cars can be applied to
recognize trucks. Note that it takes a lot of time to train a model from scratch.
Link: http://cs231n.github.io/transfer-learning/
In practice, very few people train an entire Convolutional Network from scratch (with random
initialization), because it is relatively rare to have a dataset of sufficient size. Instead, it is
common to pretrain a ConvNet on a very large dataset (e.g. ImageNet, which contains 1.2
million images with 1000 categories), and then use the ConvNet either as an initialization or a
fixed feature extractor for the task of interest. The three major Transfer Learning scenarios look
as follows:
ConvNet as fixed feature extractor. Take a ConvNet pretrained on ImageNet, remove the last fully-
connected layer (this layer’s outputs are the 1000 class scores for a different task like ImageNet),
then treat the rest of the ConvNet as a fixed feature extractor for the new dataset. In an AlexNet,
this would compute a 4096-D vector for every image that contains the activations of the hidden layer
immediately before the classifier. We call these features CNN codes. It is important for performance
that these codes are ReLUd (i.e. thresholded at zero) if they were also thresholded during the
training of the ConvNet on ImageNet (as is usually the case). Once you extract the 4096-D codes for
all images, train a linear classifier (e.g. Linear SVM or Softmax classifier) for the new dataset.
Fine-tuning the ConvNet. The second strategy is to not only replace and retrain the classifier on top
of the ConvNet on the new dataset, but to also fine-tune the weights of the pretrained network by
continuing the backpropagation. It is possible to fine-tune all the layers of the ConvNet, or it’s
possible to keep some of the earlier layers fixed (due to overfitting concerns) and only fine-tune
some higher-level portion of the network. This is motivated by the observation that the earlier
features of a ConvNet contain more generic features (e.g. edge detectors or color blob detectors)
that should be useful to many tasks, but later layers of the ConvNet becomes progressively more
specific to the details of the classes contained in the original dataset. In case of ImageNet for
example, which contains many dog breeds, a significant portion of the representational power of the
ConvNet may be devoted to features that are specific to differentiating between dog breeds.
Pretrained models. Since modern ConvNets take 2-3 weeks to train across multiple GPUs on
ImageNet, it is common to see people release their final ConvNet checkpoints for the benefit of
others who can use the networks for fine-tuning. For example, the Caffe library has a Model
Zoo where people share their network weights.
When and how to fine-tune? How do you decide what type of transfer learning you should
perform on a new dataset? This is a function of several factors, but the two most important ones
are the size of the new dataset (small or big), and its similarity to the original dataset (e.g.
ImageNet-like in terms of the content of images and the classes, or very different, such as
microscope images). Keeping in mind that ConvNet features are more generic in early layers
and more original-dataset-specific in later layers, here are some common rules of thumb for
navigating the 4 major scenarios:
1. New dataset is small and similar to original dataset. Since the data is small, it is not a good idea to
fine-tune the ConvNet due to overfitting concerns. Since the data is similar to the original data, we
expect higher-level features in the ConvNet to be relevant to this dataset as well. Hence, the best
idea might be to train a linear classifier on the CNN codes.
2. New dataset is large and similar to the original dataset. Since we have more data, we can have more
confidence that we won’t overfit if we were to try to fine-tune through the full network.
3. New dataset is small but very different from the original dataset. Since the data is small, it is likely
best to only train a linear classifier. Since the dataset is very different, it might not be best to train
the classifier form the top of the network, which contains more dataset-specific features. Instead, it
might work better to train the SVM classifier from activations somewhere earlier in the network.
4. New dataset is large and very different from the original dataset. Since the dataset is very large, we
may expect that we can afford to train a ConvNet from scratch. However, in practice it is very often
still beneficial to initialize with weights from a pretrained model. In this case, we would have enough
data and confidence to fine-tune through the entire network.
Practical advice. There are a few additional things to keep in mind when performing Transfer
Learning:
Constraints from pretrained models. Note that if you wish to use a pretrained network, you may be
slightly constrained in terms of the architecture you can use for your new dataset. For example, you
can’t arbitrarily take out Conv layers from the pretrained network. However, some changes are
straight-forward: Due to parameter sharing, you can easily run a pretrained network on images of
different spatial size. This is clearly evident in the case of Conv/Pool layers because their forward
function is independent of the input volume spatial size (as long as the strides “fit”). In case of FC
layers, this still holds true because FC layers can be converted to a Convolutional Layer: For example,
in an AlexNet, the final pooling volume before the first FC layer is of size [6x6x512]. Therefore, the FC
layer looking at this volume is equivalent to having a Convolutional Layer that has receptive field size
6x6, and is applied with padding of 0.
Learning rates. It’s common to use a smaller learning rate for ConvNet weights that are being fine-
tuned, in comparison to the (randomly-initialized) weights for the new linear classifier that computes
the class scores of your new dataset. This is because we expect that the ConvNet weights are
relatively good, so we don’t wish to distort them too quickly and too much (especially while the new
Linear Classifier above them is being trained from random initialization).
____________________________________________________________________________
____________________________________________________________________________
https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html
____________________________________________________________________________
____________________________________________________________________________
IPYNB
____________________________________________________________________________
____________________________________________________________________________
If dataset size is less, use pre-trained models with fine tuning on around 100 images per person
Collect data in various conditions Data augmentation Transfer learning Use Categorical
log loss (cross entropy with Softmax) Try cloud APIs for faster training (Microsoft Azure,
Google Images, etc.)
____________________________________________________________________________
54
DEEP LEARNING: LONG SHORT-TERM MEMORY (LSTMS)
____________________________________________________________________________
____________________________________________________________________________
Suited for sequences of data, of words, etc.; in most sentences the sequence becomes
important in addition to presence of a word
In all vectorization methods of textual data which work on the occurrence of a word, they discard
the semantic meaning or sequence information; Machine Translation: sequence of French
sentence to English; Speech recognition
In time series data, at new instant we have an observation. (Stock market, ride pickups)
Also sentences can have different lengths; we can zero pad all sentences to a common length.
But this is memory inefficient, high number of parameters.
RNNS: Need: each input is of different length, number of parameters should be less
____________________________________________________________________________
____________________________________________________________________________
Recurrent: repeating
Ex:
Let O5 be the output of final time stamp. On O5 we will apply an activation function to get
y_i_hat. This activation function will have a Weight Matrix.
Three weight matrices: W for input, W’ for previous state output and W’’ before the final
activation function. To create this as a repetitive structure, we can have a dummy output (O0)
before the first input. Weights can be initiated using Xavier/ Glorot method.
To get output we apply an activation function to the output of the desired time step cell.
RNN
The same neurons are used over the whole sequence of words of the sentence, the RNN layer
is generally confused to be of a number of layers at different time steps, which is wrong.
____________________________________________________________________________
____________________________________________________________________________
For RNNs Back propagation through time is used. Back propagation is unrolled over time.
At the end of back propagation, multiplications are large which results in vanishing or exploding
gradients. The number of layers may be low but the because of recurrence of the computations
which depends on the length of sequences.
So I got the hang of these RNN Back prop through time concept: for cases such as one to
many or many to many. It is just that we calculate del l/ del w at each time step and update
W. While calculating the gradient we follow all the paths from Loss function to the weight
matrices. For ex. Del l / del w = ( del l / del y2_hat * del y2_hat/ del On * del On/ del O(n-1)
… * del Ok/ del W ) + ( del l / del y1_hat * del y1_hat/ del O(n-1) * del O(n-1)/ del O(n-2) … *
del Ok/ del W ) + other outputs on which Loss function is depended on.
54.4 TYPES OF RNNS
One to many: Image captioning (Input: Image matrix, Output: Sequence of words)
____________________________________________________________________________
____________________________________________________________________________
Video Lecture
What is the problem with Simple RNNs?
Many to many same length RNN:
yi4 depends a lot on xi4 and O3 and depends less on xi1 and O1 due to vanishing
gradients
Simple RNNs cannot take care of long term dependency, when 4th output
depends a lot on 1st input
____________________________________________________________________________
54.6 LSTM.
____________________________________________________________________________
Lecture:
LSTM: takes care of long term dependencies as well as short term dependencies
LSTM: Long Short Term Memory
Simple RNN:
Additional material
a. http://colah.github.io/posts/2015-08-Understanding-LSTMs/
____________________________________________________________________________
54.7 GRU
____________________________________________________________________________
Detailed explanation
Source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-var-GRU.png
- GRU cell is a simplified version of the LSTM cell and it performs as well as LSTM
- Cell states from previous cell are merged into a single vector
- Gate controller controls the forget gate and the input gate
- There is no separate output gate, the full cell state is given as output
- Usage: keras.layers.GRU()
- Still it is difficult to retain very-long term patterns with these RNNs
Video Lecture
- GRU: Gated Recurrent Unit
- LSTMS were created in 1997, GRUs - 2014
- LSTM have 3 gates: input, output, forget
- GRUs: simplified version inspired from LSTMs, faster to train, as powerful as
LSTM
- GRUs: 2 gates: reset, update
- The short-circuit structure is necessary for having long term dependencies
Summary of comments
Applications of long term dependencies: https://en.wikipedia.org/wiki/Long_short-
term_memory#Applications
- Predicting sales, finding stock market trends
- Understanding movie plots
- Speech recognition
- Music composition
Additional material
a. https://www.slideshare.net/hytae/recent-progress-in-rnn-and-nlp-63762080
b. https://towardsdatascience.com/understanding-gru-networks-2ef37df6c9be
c. https://datascience.stackexchange.com/questions/14581/when-to-use-gru-over-
lstm
____________________________________________________________________________
____________________________________________________________________________
Video Lecture
- In MLPs, built one layer and then extended it to Deep MLPs
- In CNN, built one layer and then extended it to Deep CNN
- Similarly, with GRUs/LSTMs we will extend it to Deep RNNs
- Stacking one layer one over the other to build multiple layers
- Deep RNN structure as other NNs are problem specific
- Backpropagation and Forward propagation can be determined by the direction of
arrows
- Back propagation occurs across time and across depth also
Summary of comments
- The number of units in each layer of RNN is a hyperparameter that we need to
tune. They're not dependent on the number of words in a sentence.
- Number of words in sentence == Number of time a cell will be unfolded along
time axis == Maximum length of sentence among all sentences != number of
units
- Seems like you are getting confused between the embedding and padding.
Please note that padding is different from embedding. We make padding to
sentences whereas here embedding is done to words. Suppose we have a
sentence with 8 words and we pad them to 10 by adding two zeros in the end,
we now have a 1*10 vector. Now using embedding layer, if we want to represent
each word using some 5 dimensional matrix, then we will have 1*10*5 . At each
time step we send in 1*5 vector and we send 10 such vectors to the LSTM unit.
We can send sentences without padding as well. But we can send only one
sentence at a time if we don't use padding. But we cannot send two words of
different dimensions into LSTM as the length of cell state, hidden state are fixed.
- Forget gate parameters are learnt by backpropagation.
- Memoization is taken care by Tensorflow backend
- During Backprop, weights are updated after all time steps.
Additional material
a. https://www.analyticsvidhya.com/blog/2017/12/fundamentals-of-deep-learning-
introduction-to-lstm/
____________________________________________________________________________
____________________________________________________________________________
Detailed explanation
In a regular RNN, at each time step the layer just looks at past and present inputs. In
Machine Translation tasks it is important to look ahead in the words that come next. This
is implemented by using two recurrent layers which start from the two extremities of the
sentence, and combine their outputs at each time step.
Usage: keras.layers.Bidirectional(keras.layers.GRU())
Video Lecture
Bi- directional RNNs:
NLP: input: x, output: y
In simple RNNs, yi3 is assumed to be dependent on x11, x12, x13
But if yi3 is dependent on x14, x15 dependent which comes after yi3’s time step.
____________________________________________________________________________
Video Lecture
Additional material
- Derivation of parameters in an LSTM layer
(https://www.youtube.com/watch?v=l5TAISVlYM0&feature=youtu.be)
● LSTM layer with m units:
○ There are four gates:
■ “f” First gate: Bias, input n-dimensional vector, cell state m-
dimensional, Weights vector m+n dimensional: Number of
parameters: (n+m+1)
■ “i” Second gate: similar number of params
■ “c” Third gate: similar
■ “o” Fourth gate: similar
■ Parameters: 4(n+m+1) per cell
■ Total: 4m(n+m+1) per LSTM layer with m units, n = input size
■ = 4(nm+m**2 + m)
55
DEEP LEARNING: GENERATIVE ADVERSARIAL NETWORKS (GANs)
____________________________________________________________________________
56
LIVE: ENCODER-DECODER MODELS
____________________________________________________________________________
57
ATTENTION MODELS IN DEEP LEARNING
____________________________________________________________________________
____________________________________________________________________________
Basics of NLP:
Deep Learning: