0% found this document useful (0 votes)
15 views8 pages

Ml2 Summary

Machine learning summary

Uploaded by

Mariela Ls
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views8 pages

Ml2 Summary

Machine learning summary

Uploaded by

Mariela Ls
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

Topic 1 Error rate = (7+6+3+4+3+1) / 100 = 24 / 100 = 0.

24
P value: less than 0.05 means that the null hypothesis is rejected and there is a Optimal threshold
strong relationship between the y dependent variables and the x variable Y = 1 not repaid(costly) Y = 0 repaid

Residuals: the difference between the actual value and the value predicted by the FP= L(0,1)L(0,1) = L(L(truth: repaid,
model predicted: default)) = 10'000 → rejecting a
loan (predicting default) that would have been repaid.

n rows ⟺⟺ n observations columns ⟺⟺ q variables


Data Matrix: A table with variables in columns and observations in rows
FN = L(1,0)L(1,0) = L(L(truth: default, predicted: repaid)) = 100'000 → giving a
loan (predicting to get repaid) but customer defaults.
Y = Response Variable. X = Predictor Variables ROC Curve-checking all possible thresholds.
Linear Regression: It assumes a linear relationship and your dependent variable True Positive Rate/Recall = TP / (TP + FN) the higher the better
is normally distributed, this is violated for binary classification. False Positive Rate = FP / (FP + TN). The lower the better
Y = β0 + β1x1 + ... + βKxK + ϵ
EX: At threshold 0, the classifier labels every instance as positive (1)
EX ->. Score = β0 + β1× Hours Studied + ε Precision = TP / (TP + FP) the higher the better
β0 = intercept, this is the expected test score when hours studied is zero
β1 = coefficient of hours studied, how much the test score is expected to change AUC% to classify data correctly between + and -. The larger the AUC, the better.
with each additional hour of study
ε = error term, this account for other factors affecting the test score (natural AUPRC The larger the AUPRC, the better.
aptitude, difficulty...)
EX: 0.8 the classifier ranks two randomly chosen test data points correctly with a
Logistic Regression- probability of 80%.
Bernoulli
η(x)=β0+β1x1+...+βpxp = linear predictor
Transforms a linear combination of input variables into a value between 0 and 1.
Calculate the linear predictor
EX -> β0= - 1.75 β1 = 0.011 x1= 50 -> −1.75 + 0.011 × 50 = -1.2
e = 2.71828
p = 1/(1+2.71828 (-(-1.2))) = 0.231 → 23.1%
the probability of not winning an Oscar since is less than
50%
There's a 50% predicted probability of y being 1.
Probability of winning an Oscar -> P(y=WinsOscar∣X=x) >δ
where δ=0.5
AIC = the lower the better, how well a model fits the data it ROC: it compares the false positives to the total number of true negatives (true 0’s)
was generated from + -> more relevant PR: it compares the false positives to the total number of
Confusion Matrix - careful with imbalanced data predicted positives (predicts 1’s). -> imbalanced data or costs for double-checking
Truth = 0 (-) Truth = 1(+) Truth = 2 predicted positives
Pred = 0 (-) 23 (TN) 7 (FN) 6
Pred = 1 (+) 3 (FP) 27(TP) 4 EX: medical diagnosis (more 0’s than 1’s) the PR curve is likely more relevant than
Pred = 2 3 1 26 the ROC curve due to the imbalanced data and the high cost of false positives.
Error (misclassification) rate = (wrongly classified) / (number of samples)
Building Classification Trees Building the partition is done analogously as
for regression trees. However, both in the greedy algorithm for finding the partition
Topic 2 and in the pruning step, instead of the RSS, one uses the total cost where the
Root Nodes -> Internal Nodes mean square error is replaced by another impurity measure Gini. -> the lower the
-> Leaf or Terminal Node better (0,1), evaluates the quality of a division. Classification: m=p and minimum
node size nmin=1
Regression Trees: (Predict
Numbers) Continuous or
Fomulas
quantitative outcomes. The
output: numerical value. It
reduced the variation of the
target variable within each
node.MSE

Binary tree labels (for 2 classes) Negative (left) and Positive (right).
Ternary tree labels (for 3 classes) written as
"A/B/C"

If condition is true, go to the left.


Assume α= 0.5
Classification Trees: (Categorize items into
classes) Categorical outcomes. The output is a MCE / Probability of class "no side effect" for "old": 10 / (10+50) = 0.167
class label. Ideally, each node should contain
data points that belong to a single class.
GINI&MCE Probability of class "side effect" for "old": 50 / (10+50) = 0.83

Tuning parameter α balances the model goodness of fit and the model size Gini(old): 10/60*(1-10/60) + 50/60*(1-50/60) = 0.167*(1-0.167) + 0.833*(1-0.833)
(complexity). α is chosen by cross-validation. = 0.28

Building Regression Tree The goal is to minimize the residual sum of squares Total Cost: 60 * 0.28 + 40 * 0 = 16.8
(RSS) the sum of the squared differences between the observed values and the
values predicted by the model. Variance the spread of a set of data points, how far Best split variable: 16.8(age) < 47.5(sex) → split on age
each number in the set is from the mean. 1.-Selecting the best split with the most
similar target variable where the outcomes (like having a side effect or not) are
more consistent within each group 2.- Evaluating Split Quality, the reduction in Cost-complexity Cα(T): Left, 3 leaves-> 40*0 + 50*0 + 10*0 + 0.5*3 = 1.5
variance or mean squared error(MSE) 3.- Creating Terminal Nodes the predicted Right, 2 leaves-> 60 * 0.167 + 40 * 0 + 0.5 * 2 = 11
value for each leaf is the mean of the target variable for the observation within that
leaf 4.- Recursively Splitting the process is applied to each of the resulting node, Should we prune: 1.5 < 11.0 → don't prune
when the criteria are met for all node 5.- Pruning: when you make your tree
shorter, for instance because you want to avoid overfitting can be guided by cross- OverFitting: 1.- Complexity of the Tree: If the tree has many splits, it may be too
validation to find the optimal tree sizes Note: Only a finite number of α’s needs to finely tuned to the training data. 2.- Leaf Size: If there are many leaves with very
be considered since there are only finitely many relevant subtrees, and it is chosen few instances in them 3.-Performance on Validation Data: If the model performs
by cross validation. Regression: m=p/3 and minimum node size nmin=5. significantly better on the training data than on validation data 4.- Lack of
Pruning: Pruning is a technique used to reduce the size of a tree to prevent
overfitting. If the tree is not pruned (as indicated by cp = 0 which means no
complexity penalty for adding another split), it may overfit. 5.- Depth of the Tree: Boosting often results in
refers to the length of the longest path from the tree's root node to a leaf. It the highest predictive
measures how many "levels" of decision nodes exist in the tree. accuracy for “structured /
tabulated” data. Warning:
Cost complexity pruning to avoid overfitting. By increasing the alpha value, you In contrast to Random
increase the penalty for complexity. The goal is to find the alpha that minimizes Forest, boosting will overfit
impurity without sacrificing too much predictive accuracy. This is typically done by if you add too many trees.
cross-validation, where the tree is pruned at each alpha, and the one that results in Tuning Parameters:
the best cross-validated performance is chosen. Number of trees MM (most
importanparameter),
Random Forest to build many trees and then aggregate the different trees, i.e. Learning rate vv (usually the
make an average, 1000 slightly different trees, with slightly different predictions smaller the better, ≤ 0.1),
Advantages: high accuracy, robustness, and low bias feature importance. σ2 is Tree related parameters
the variance of one tree, ρ is the correlation between two trees → how similar two (Maximal depth of trees, Minimal number of samples per leaf)
trees are.
Bootstrap:drawing(N)samples from the original dataset with replacement. Topic 3 if one (or several) variable has much higher variance than the other
variables, it will dominate the analysis. I.e., the first PC will essentially just be
High Variance:sensitive to training data, potentially this variable with large variance. Does the clustring change if you run it on
capturing too much noise and specific details, scaled data?Yes, it does change. Because clustering algorithms use distances
erforming well on training data but poorly on new or and these depend on the scaled dimensions.
unseen data (overfitting).High Bias: make simplistic
assumptions about the data structure, which can PCA: to analyze large dataset containing a high number of q
lead to generally incorrect predictions (dimensions/features) per observation. Goal: to reduce the dimensionality of the
(underfitting).Low Variance: less sensitive to training data and therefore more data by keeping only the principal components. Many variables are not feasible q >
stable or consistent, but may fail to capture complexities in the data. Low Bias: 3 Collinearity (high correlated variables) or more variables than observations can
makes correct predictions, as it is not overly simplified and tries to capture the true cause problems.
underlying relationship in the data. Reduce the dimension while accounting for as much as possible of the variation (=
information) in the data. Covariance: to understand if two variables are link to
The trees are not independent and reducing the correlation between the trees each other. Positive: if one increases the other one as well. Negative: if one
gives better accuracy. For each split, consider only m randomly selected variables increases the other decreases. Null: there is no relationship, one does not affect
to reduces the correlation by increasing the diversity in the forest. A high the other one.
correlation means that the trees tend to make very similar decisions or have
similar structures, while a low correlation indicates that the trees make more
In PCA (Principal Component Analysis), when two variables have the same
independent decisions from each other.
distance but in opposite directions, the covariance between them is indeed 0
OOB error rate is an estimate of the model's performance on unseen data and is
computed during the training process.It is calculated based on the predictions Normalization (0-1) & mean centering, the mean is zero. The first principal
made by each tree in the Random Forest ensemble on the out-of-bag samples component has the largest variance, the second component the second largest
(samples not used for training that tree). variance and so on. The vectors are called loadings, and the values of the data points
are called scores. Loadings describe variable contributions to components, while
The error rate calculated from the confusion matrix is based on the predictions scores reveal data point positions in component space.
made by the trained model on the entire dataset.
Steps to perform PCA: 1.-Standardization, 2.-Covariance Matrix, 3.-Eigen
Decomposition, 4.-Sort By Eigen Values, 5.-Choose your Principal Component
We call the data scaled if all variables have mean 0 and variance 1. positively correlated, the first principal component is often some kind
of average of the variables.

Then the other principal components give important information about the
remaining patterns or shapes

Comp1: factor that affects in a positive way both variables since it is positive and
high.
Comp2: opposite signs mean the difference between the behavior of these two. If
nestle has a high performance, Novartis would have a low performance.

The signs of the PC loadings are arbitrary. the sign of the loadings, which are
the coefficients assigned to each variable in the principal component, is arbitrary.
This means that the direction in which the eigenvector—the principal component—
points in the multidimensional space is not important. Whether an eigenvector has
a positive or negative sign, it still represents the same axis of variation within the
data.

Topic 4
How many PC’S? Goal of MDS: Represent qq-dimensional data in a low-dimensional space
1.- The cumulative proportion should be at last 80%
while preserving distancesbetween points as much as possible.
2.- Keep only PC’S before the elbow
Key difference between PCA and MDS: PCA use quantitive data, MDS
quantitive and qualitive; PCA reduce the number of variables in the data while
retaining as much info(variance) as possible. MDS aims to visualize the
structure of the data by representing the the distances or dissimilarities
between points.

Distances for continuous data

Euclidean distance serves as a way to maintain the distances between data


points when mapping them from a higher-dimensional space to a lower-
dimensional space such as 2D ideally. In summary, we reconstructed the
points X from the distances between all points. Straight lines between two
points.

It is calculated as the square root of the sum of the squares of the differences
When all
between corresponding values of the two sets of point
variables are
Manhattan distance: It is the sum of the absolute differences of their Cartesian The stress criterion SS is the goodness-of-fit statistic that MDS tries to minimize
coordinates. and varies between 0 and 1, with values near 0 indicating better fit.

Rule of thumb for judging the fit of non-metric MDS:

 SS = 0%: perfect
 SS = 5%: good
R: function "dist"  SS = 10%: fair
 SS = 20%: poor
Python: scipy.spatial.distance.pdist
Simple matching distance (SMD) It is useful for determining how similar two data
Classical MDS: Ex: Maps sets are. An SMD value close to 1 indicates high similarity, while a value close to 0
indicates little or no similarity.

Binary data/ Nominal data

 Approximately preserves Euclidean distances. Closely related to SMD = variables in which units disagree/ total number of variables
PCA.
 R: cmdscale Jaccard distance (JD)Used in situations where 0 and 1 are not equally important /
 Python: sklearn.manifold.MDS(metric=True) informative (asymmetry)Only mutual presence (both = 1) is counted as a match

we only keep the q~q~ largest eigenvalues and corresponding


eigenvectors Boolean data / Binary data

JD= variables in which units disagree ignoring(0,0’s)/ total number of variables


How to choose the dimension q? an alternative method for choosing by looking
at a scree plot, which plots the
When to use JD over SMD:Jaccard Distance is more suitable when dealing
eigenvalues in descending
with sets of varying sizes. It accounts for the relative size of the intersection
order or 4 dimensions.
and union, making it robust to variations in set sizes.If you are interested in
emphasizing the common elements between sets rather than the overall
proportion of matching elements, Jaccard Distance is a better choice.
Example: Comparing customers by the products they bought. 1 = bought, 0 =
Non_metric MDS: did not buy. 1s are more important.

 Approximately preserves ranking of distances. Ordinal Data :


 low-dimensional representation
 Slower than classical MDS but relies on less assumptions.
 R: isoMDS
 Python: sklearn.manifold.MDS(metric=False)
 Emphasis on preserving small distances. Good for visualization of
highdimensional data. Clustering
 R: Rtsne in package Rtsne
 Python: sklearn.manifold.TSNE
 If pp = qq (equal) then it’ll be log(1) = 0, so cost function becomes
zero.
 Large pp value (similar high-dim) and small qq value (dissimilar low-
dim) → big penalty value of cost function
 Small pp value (dissimilar high-dim) and large qq value (similar low-
dim) → small penalty value of cost function
 Because of this reason, KL divergence is known as asymmetric.

MDS vs t-SNE

Gower distance to measure how different two records are. The records may t-SNE: It's like arranging items on a board so that similar things are close
contain combination of logical, categorical, numerical or text data. The distance together. Great for visualizing and clustering data with complex patterns, like
is always a number between 0 (identical) and 1 (maximally dissimilar) grouping similar images or text documents.Classical MDS: It's like placing
items on a board so that their distances match real-world relationships. Useful
R: function "daisy" in package "cluster". when you want to keep the original distances between data points, like in
geographic mapping or network analysis
Python: gower.gower_matrix(df)

binary_diff = 2 → variables in which units disagree


categorical_diff = 2 → variables in which units disagree
continuous_diff = abs(4.0-5.5) / 4.1 + abs(2.2 - 4.0) / 4.3 = 0.3659 + 0.4186 =
0.7845

Gower distance = (binary_diff + categorical_diff + continuous_diff ) /


cnt_variables = 4.7845 / 8 = 0.5980625

t-SNE : t-distributed stochastic neighbor embedding (t-SNE) is a dimension


reduction technique that is useful for visualizing high-dimensional data.
It focuses more on the local structure while still trying to keep (parts) of the
global structure. Perplexity between 5 and 50- the separation.
Scaling and distances DIstance between observations
Bottom line is: Scaling (or not) depends on the context.  Euclidean distance (continuous data)
 Manhattan distance (continuous data)
 Simple matching distance (discrete data)
If variables are not scaled:Variable with largest range has most weight.  Jaccard distance (discrete data)
Distances depend on the scales of the variables.  Gower distance (mixed data)
Distances between clusters
Scaling gives every variable equal weight.
1.- Distance between two clusters
Scale if:Variables measure different units (kg, meter, sec,…).You explicitly want to = minimal distance of all point pairs of
have equal weight for each variable. both clusters. Suitable for finding
stretched-out clusters 2.- Distance
Don’t scale if: Units are the same for all variables.
between two clusters

Often: Better to scale. = maximal distance of all point pairs of


both clusters. Suitable for finding compact but not necessarily well separated
clusters. 3. Average linkage. Distance between two clusters = average distance of
all point pairs of both clusters. Compromise between complete and single linkage.
Perplexity 5-10
Data type:
Agglomerative /Hierarchical Clustering)clustering: Why should we cat at the
large drop in the dendrogram? Because there the separation/distance between
Continuous data E.g. 3.2 cm, 10.08 cm, 4.8 cmCategorical data: Binary: 0 or 1, the clusters is the biggest → best clustering
A feature is present or not. Nominal: E.g. red, blue, green. Several categories
without an ordering. Ordinal: E.g. 1st, 2nd, 3rd. The ordering matters (e.g. 1st is
closer to 2nd than to 3rd, but we don’t know anything about how much closer). Interpreting clusters: Look at position of cluster centers (=means of variables per
cluster) or cluster representatives. If scaled data is used, it is often better for
interpretability to look at the original data instead of the scaled data. Apply a
Topic 5
dimension reduction technique (such as PCA or t-SNE) and plot the reduced
Cluster analysis: Summarizing data
dimensional data.Label / color the points according to the cluster they belong. In k-
means clustering, the mean is like the middle point of each group (cluster). It tells
Focus-> Agglomerative: Build up clusters from individual observations. Divisive: you where the center of the group is.
Start with whole group of observations and split off clusters.

First, make a hierarchical sequence of nested clusters. Start with nn separate Silhouette Plot
clusters (nn = number of observations) and merge clusters until only one cluster is S(i) large: well clustered.
left.Then, choose number of clusters.
S(i) small: badly clustered.
 Observations that are grouped together at some point cannot be separated S(i) negative: assigned to
anymore later. wrong cluster
 By cutting the tree at a certain height, one obtains a number of clusters. Average above 0.5 is
 Results depend on how we measure distances between observations and acceptable.
between clusters.
K- means: The cluster centers are the means. These can be arbitrary points in 3. For non-metric
space, i.e. do not have to be observations.k-medoids: The cluster centers are multidimensional
observations. Medoid: O a specific data point in a cluster that has the smallest scaling, using scaled
average distance to all other points in the cluster .Partitioning around medoids data gives the same results as using the original (un-scaled) data. (Yes)
(PAM) is the most common method. 4. For k-means, the cluster centers are necessarily close to data points. (No)

What happens to the WGSS if the numbers of clusters becomes larger? 5. When applying t-distributed stochastic neighbor embedding (t-SNE), points that
WGSS gets smaller.Kmean speed: Agglomerative are far apart in the original space have a large weight in the Kullback-Leibler
Clustering: O(n3)O(n3) K-Means divergence which is minimized by t-SNE. (No)
Clustering: O(n∗k∗i∗d)O(n∗k∗i∗d)
6. We assume that the variance of one variable is a lot larger than the variances of
all other variables. In this case, if we use partitioning around medoids (PAM) for
Use the number of clusters after last big drop in
clustering, the variable with the largest variance will dominate the analysis. (Yes)
WGSS-> Before 4 clusters the WGSS drops a lot
with every cluster we add.
7. In binary classification, linear discriminant analysis and logistic regression are
After 4 clusters the WGSS only gets marginal
called “linear” methods since the decision boundary depends linearly on the
smaller with every new cluster added, as we start breaking up clusters.
predictor variables. (Yes)
PAM vs K-means: PAM is better at handling outliers than k-means.
always has a 𝑛 depth of 2 where 𝑛 denotes the number of samples. (No)
8. A dendrogram obtained from applying the complete-linkage clustering method
PAM can work with various types of distances, not just Euclidean, unlike k-
means.PAM makes it easier to identify representative objects for each cluster,
which is useful for interpretation.Like k-means, the outcome of PAM also depends
9. A random forest consisting of 50 trees is grown. The split chosen in the root
on the initial values chosen.PAM requires more computational resources than k-
node for every tree must not be the same. (Yes)
means.
10. A benefit of a random forest regressor compared to a single regression tree is
GMM based clustering: new point x is "within" the normal distribution that it reduces the bias of the predictions. (No)
around μ2.
11. When applying PCA to unscaled data, the 1st principal component is likely to
have a high contribution from the variable that has the highest variance in the
DBSCAN: This is a clustering method based on density. It identifies clusters as
dataset. (Yes)
areas of high density, separated by areas of low density. Advantages of DBSCAN:
Arbitrary Shapes: It can find clusters of any shape. Outlier Resistance: DBSCAN is
In logistic regression, when all coefficients (including the intercept term) are 0, the
not easily affected by outliers.No Need for Cluster Number: Unlike other methods,
model predicts the log-odds of the outcome as 0. Since logistic regression uses
you don't have to specify the number of clusters in advance with DBSCAN. small ε:
the logistic function to map predicted log-odds to probabilities, this results in a
All point are noise points.Large ε: You could end up with just one big group
probability of 0.5, not 1.
because most points will be close to each other. All point are core points. it's good
for finding points that don't fit into any group.
The pruning of regression trees reduces their bias and increases their variance.
(Yes)
Classical multidimensional scaling (MDS) minimizes the distances between the
reconstructed locations. (No)

2. The goal of non-metric multidimensional scaling is to respect both the distances


and the ranking of the distances. (No)

You might also like