Ml2 Summary
Ml2 Summary
24
P value: less than 0.05 means that the null hypothesis is rejected and there is a Optimal threshold
strong relationship between the y dependent variables and the x variable Y = 1 not repaid(costly) Y = 0 repaid
Residuals: the difference between the actual value and the value predicted by the FP= L(0,1)L(0,1) = L(L(truth: repaid,
model predicted: default)) = 10'000 → rejecting a
loan (predicting default) that would have been repaid.
Binary tree labels (for 2 classes) Negative (left) and Positive (right).
Ternary tree labels (for 3 classes) written as
"A/B/C"
Tuning parameter α balances the model goodness of fit and the model size Gini(old): 10/60*(1-10/60) + 50/60*(1-50/60) = 0.167*(1-0.167) + 0.833*(1-0.833)
(complexity). α is chosen by cross-validation. = 0.28
Building Regression Tree The goal is to minimize the residual sum of squares Total Cost: 60 * 0.28 + 40 * 0 = 16.8
(RSS) the sum of the squared differences between the observed values and the
values predicted by the model. Variance the spread of a set of data points, how far Best split variable: 16.8(age) < 47.5(sex) → split on age
each number in the set is from the mean. 1.-Selecting the best split with the most
similar target variable where the outcomes (like having a side effect or not) are
more consistent within each group 2.- Evaluating Split Quality, the reduction in Cost-complexity Cα(T): Left, 3 leaves-> 40*0 + 50*0 + 10*0 + 0.5*3 = 1.5
variance or mean squared error(MSE) 3.- Creating Terminal Nodes the predicted Right, 2 leaves-> 60 * 0.167 + 40 * 0 + 0.5 * 2 = 11
value for each leaf is the mean of the target variable for the observation within that
leaf 4.- Recursively Splitting the process is applied to each of the resulting node, Should we prune: 1.5 < 11.0 → don't prune
when the criteria are met for all node 5.- Pruning: when you make your tree
shorter, for instance because you want to avoid overfitting can be guided by cross- OverFitting: 1.- Complexity of the Tree: If the tree has many splits, it may be too
validation to find the optimal tree sizes Note: Only a finite number of α’s needs to finely tuned to the training data. 2.- Leaf Size: If there are many leaves with very
be considered since there are only finitely many relevant subtrees, and it is chosen few instances in them 3.-Performance on Validation Data: If the model performs
by cross validation. Regression: m=p/3 and minimum node size nmin=5. significantly better on the training data than on validation data 4.- Lack of
Pruning: Pruning is a technique used to reduce the size of a tree to prevent
overfitting. If the tree is not pruned (as indicated by cp = 0 which means no
complexity penalty for adding another split), it may overfit. 5.- Depth of the Tree: Boosting often results in
refers to the length of the longest path from the tree's root node to a leaf. It the highest predictive
measures how many "levels" of decision nodes exist in the tree. accuracy for “structured /
tabulated” data. Warning:
Cost complexity pruning to avoid overfitting. By increasing the alpha value, you In contrast to Random
increase the penalty for complexity. The goal is to find the alpha that minimizes Forest, boosting will overfit
impurity without sacrificing too much predictive accuracy. This is typically done by if you add too many trees.
cross-validation, where the tree is pruned at each alpha, and the one that results in Tuning Parameters:
the best cross-validated performance is chosen. Number of trees MM (most
importanparameter),
Random Forest to build many trees and then aggregate the different trees, i.e. Learning rate vv (usually the
make an average, 1000 slightly different trees, with slightly different predictions smaller the better, ≤ 0.1),
Advantages: high accuracy, robustness, and low bias feature importance. σ2 is Tree related parameters
the variance of one tree, ρ is the correlation between two trees → how similar two (Maximal depth of trees, Minimal number of samples per leaf)
trees are.
Bootstrap:drawing(N)samples from the original dataset with replacement. Topic 3 if one (or several) variable has much higher variance than the other
variables, it will dominate the analysis. I.e., the first PC will essentially just be
High Variance:sensitive to training data, potentially this variable with large variance. Does the clustring change if you run it on
capturing too much noise and specific details, scaled data?Yes, it does change. Because clustering algorithms use distances
erforming well on training data but poorly on new or and these depend on the scaled dimensions.
unseen data (overfitting).High Bias: make simplistic
assumptions about the data structure, which can PCA: to analyze large dataset containing a high number of q
lead to generally incorrect predictions (dimensions/features) per observation. Goal: to reduce the dimensionality of the
(underfitting).Low Variance: less sensitive to training data and therefore more data by keeping only the principal components. Many variables are not feasible q >
stable or consistent, but may fail to capture complexities in the data. Low Bias: 3 Collinearity (high correlated variables) or more variables than observations can
makes correct predictions, as it is not overly simplified and tries to capture the true cause problems.
underlying relationship in the data. Reduce the dimension while accounting for as much as possible of the variation (=
information) in the data. Covariance: to understand if two variables are link to
The trees are not independent and reducing the correlation between the trees each other. Positive: if one increases the other one as well. Negative: if one
gives better accuracy. For each split, consider only m randomly selected variables increases the other decreases. Null: there is no relationship, one does not affect
to reduces the correlation by increasing the diversity in the forest. A high the other one.
correlation means that the trees tend to make very similar decisions or have
similar structures, while a low correlation indicates that the trees make more
In PCA (Principal Component Analysis), when two variables have the same
independent decisions from each other.
distance but in opposite directions, the covariance between them is indeed 0
OOB error rate is an estimate of the model's performance on unseen data and is
computed during the training process.It is calculated based on the predictions Normalization (0-1) & mean centering, the mean is zero. The first principal
made by each tree in the Random Forest ensemble on the out-of-bag samples component has the largest variance, the second component the second largest
(samples not used for training that tree). variance and so on. The vectors are called loadings, and the values of the data points
are called scores. Loadings describe variable contributions to components, while
The error rate calculated from the confusion matrix is based on the predictions scores reveal data point positions in component space.
made by the trained model on the entire dataset.
Steps to perform PCA: 1.-Standardization, 2.-Covariance Matrix, 3.-Eigen
Decomposition, 4.-Sort By Eigen Values, 5.-Choose your Principal Component
We call the data scaled if all variables have mean 0 and variance 1. positively correlated, the first principal component is often some kind
of average of the variables.
Then the other principal components give important information about the
remaining patterns or shapes
Comp1: factor that affects in a positive way both variables since it is positive and
high.
Comp2: opposite signs mean the difference between the behavior of these two. If
nestle has a high performance, Novartis would have a low performance.
The signs of the PC loadings are arbitrary. the sign of the loadings, which are
the coefficients assigned to each variable in the principal component, is arbitrary.
This means that the direction in which the eigenvector—the principal component—
points in the multidimensional space is not important. Whether an eigenvector has
a positive or negative sign, it still represents the same axis of variation within the
data.
Topic 4
How many PC’S? Goal of MDS: Represent qq-dimensional data in a low-dimensional space
1.- The cumulative proportion should be at last 80%
while preserving distancesbetween points as much as possible.
2.- Keep only PC’S before the elbow
Key difference between PCA and MDS: PCA use quantitive data, MDS
quantitive and qualitive; PCA reduce the number of variables in the data while
retaining as much info(variance) as possible. MDS aims to visualize the
structure of the data by representing the the distances or dissimilarities
between points.
It is calculated as the square root of the sum of the squares of the differences
When all
between corresponding values of the two sets of point
variables are
Manhattan distance: It is the sum of the absolute differences of their Cartesian The stress criterion SS is the goodness-of-fit statistic that MDS tries to minimize
coordinates. and varies between 0 and 1, with values near 0 indicating better fit.
SS = 0%: perfect
SS = 5%: good
R: function "dist" SS = 10%: fair
SS = 20%: poor
Python: scipy.spatial.distance.pdist
Simple matching distance (SMD) It is useful for determining how similar two data
Classical MDS: Ex: Maps sets are. An SMD value close to 1 indicates high similarity, while a value close to 0
indicates little or no similarity.
Approximately preserves Euclidean distances. Closely related to SMD = variables in which units disagree/ total number of variables
PCA.
R: cmdscale Jaccard distance (JD)Used in situations where 0 and 1 are not equally important /
Python: sklearn.manifold.MDS(metric=True) informative (asymmetry)Only mutual presence (both = 1) is counted as a match
MDS vs t-SNE
Gower distance to measure how different two records are. The records may t-SNE: It's like arranging items on a board so that similar things are close
contain combination of logical, categorical, numerical or text data. The distance together. Great for visualizing and clustering data with complex patterns, like
is always a number between 0 (identical) and 1 (maximally dissimilar) grouping similar images or text documents.Classical MDS: It's like placing
items on a board so that their distances match real-world relationships. Useful
R: function "daisy" in package "cluster". when you want to keep the original distances between data points, like in
geographic mapping or network analysis
Python: gower.gower_matrix(df)
First, make a hierarchical sequence of nested clusters. Start with nn separate Silhouette Plot
clusters (nn = number of observations) and merge clusters until only one cluster is S(i) large: well clustered.
left.Then, choose number of clusters.
S(i) small: badly clustered.
Observations that are grouped together at some point cannot be separated S(i) negative: assigned to
anymore later. wrong cluster
By cutting the tree at a certain height, one obtains a number of clusters. Average above 0.5 is
Results depend on how we measure distances between observations and acceptable.
between clusters.
K- means: The cluster centers are the means. These can be arbitrary points in 3. For non-metric
space, i.e. do not have to be observations.k-medoids: The cluster centers are multidimensional
observations. Medoid: O a specific data point in a cluster that has the smallest scaling, using scaled
average distance to all other points in the cluster .Partitioning around medoids data gives the same results as using the original (un-scaled) data. (Yes)
(PAM) is the most common method. 4. For k-means, the cluster centers are necessarily close to data points. (No)
What happens to the WGSS if the numbers of clusters becomes larger? 5. When applying t-distributed stochastic neighbor embedding (t-SNE), points that
WGSS gets smaller.Kmean speed: Agglomerative are far apart in the original space have a large weight in the Kullback-Leibler
Clustering: O(n3)O(n3) K-Means divergence which is minimized by t-SNE. (No)
Clustering: O(n∗k∗i∗d)O(n∗k∗i∗d)
6. We assume that the variance of one variable is a lot larger than the variances of
all other variables. In this case, if we use partitioning around medoids (PAM) for
Use the number of clusters after last big drop in
clustering, the variable with the largest variance will dominate the analysis. (Yes)
WGSS-> Before 4 clusters the WGSS drops a lot
with every cluster we add.
7. In binary classification, linear discriminant analysis and logistic regression are
After 4 clusters the WGSS only gets marginal
called “linear” methods since the decision boundary depends linearly on the
smaller with every new cluster added, as we start breaking up clusters.
predictor variables. (Yes)
PAM vs K-means: PAM is better at handling outliers than k-means.
always has a 𝑛 depth of 2 where 𝑛 denotes the number of samples. (No)
8. A dendrogram obtained from applying the complete-linkage clustering method
PAM can work with various types of distances, not just Euclidean, unlike k-
means.PAM makes it easier to identify representative objects for each cluster,
which is useful for interpretation.Like k-means, the outcome of PAM also depends
9. A random forest consisting of 50 trees is grown. The split chosen in the root
on the initial values chosen.PAM requires more computational resources than k-
node for every tree must not be the same. (Yes)
means.
10. A benefit of a random forest regressor compared to a single regression tree is
GMM based clustering: new point x is "within" the normal distribution that it reduces the bias of the predictions. (No)
around μ2.
11. When applying PCA to unscaled data, the 1st principal component is likely to
have a high contribution from the variable that has the highest variance in the
DBSCAN: This is a clustering method based on density. It identifies clusters as
dataset. (Yes)
areas of high density, separated by areas of low density. Advantages of DBSCAN:
Arbitrary Shapes: It can find clusters of any shape. Outlier Resistance: DBSCAN is
In logistic regression, when all coefficients (including the intercept term) are 0, the
not easily affected by outliers.No Need for Cluster Number: Unlike other methods,
model predicts the log-odds of the outcome as 0. Since logistic regression uses
you don't have to specify the number of clusters in advance with DBSCAN. small ε:
the logistic function to map predicted log-odds to probabilities, this results in a
All point are noise points.Large ε: You could end up with just one big group
probability of 0.5, not 1.
because most points will be close to each other. All point are core points. it's good
for finding points that don't fit into any group.
The pruning of regression trees reduces their bias and increases their variance.
(Yes)
Classical multidimensional scaling (MDS) minimizes the distances between the
reconstructed locations. (No)