Data science cheat sheet
Data science cheat sheet
0
Last Updated June 19, 2021
OCcurs
number of successes
at an average rate
T in
A
a fixed
> ,
time interval,
with u =o
where
=A Actual Yes| True Positive (1- 8)
Predict Yes Predict No
False Negative (B)
Odds(Y = 1) = PT where P(3) = 1:2 odds
Coefficients are linearly related to odds, such that a one unit
Continuous Actual No False Positive (a) True Negative (1oa) increase in *1 affects odds by el
Uniform- all values between a and b are equally likely TP
-Precision=TP+FP: Percent correct when predict positive
Decision Trees
with
=
and o2
=
or 1if
2
discre a -Recall, TP
Sensitivity = TP4FN: Percent of actual positives
identified correctly. (True Positive Rate) Classification and Regression Tree
Normal/Gaussian N(j4, o), Standard Normal Z ~ N(0,1)
-
Central Limit Theorem sample mean of i.i.d. data
approaches normal distribution
-
ROC Curve
o
Fi =2 precsion+recall useful
TNE
,
percent of actual negatives identified
correctly, also 1 FPR (True Negative Rate)
s when classes are imbalanced
plots TPR vs. FPR for every threshold a. Area
-
CART for regression minimizes SSE by splitting data into
sub-regions and predicting the average value at leaf nodes.
The complexity parameter cp only keeps splits that reduce loss
by at least cp (small cp > deep tree)
Normal Approximation discrete distributions such as
-
Under the Curve measures how likely the model differentiates width <0.6
petal
Binomial and Poisson can be approximated using Z-Scores
positives and negatives (perfect AUC =1, baseline = 0.5).
when np, nq, and X are greater than 10 Precision-Recall Curve focuses on the correct prediction
-
,n
the mean of assignments and repeat until convergence. Using
the median or medoid (actual data point) may be more robust
Projects data onto orthogonal vectors that maximize variance.
emember, given an n X n matrix A, a nonzero vector and
,
to noise and outliers. k-modes is used for categorical data. a scaler A, if A = AT then and A are an eigenvector and
k-meanst+ improves selection of initial clusters eigenvalue of A. In PCA, the eigenvectors are uncorrelated
and represent principal components.
1. Pick the first center randomly
support vectors - 1. Start with the covariance matrix of standardized data
2. Compute distance between points and the nearest center
2. Calculate eigenvalues and eigenvectors using SVD or
3. Choose new center using a weighted probability
eigendecomposition
distribution proportional to distance
Support Vector Classifiers account for outliers through
-
3. Rank the principal components by their proportion of
4. Repeat until k centers are chosen
the regularization parameter C, which penalizes variance explained =
misclassifications in the margin by a factor of C> 0 Evaluating the number of clusters and performance:
Data should be linearly related, and for a p-dimensional
Kernel Functions solve nonlinear problems by computing
- Silhouette Value measures how similar a data point is to
-
dataset, there will be p principal components.
the similarity between points a, b and mapping the data to a its own cluster compared to other clusters, and ranges from 1
Note, PCA explains the variance in X, not necessarily Y.
higher dimension. Common functions: (best) to -1 (worst). Sparse PCA - constrains the number of non-zero values in
-
Polynomial (ab+r)" Davies-Bouldin Index ratio of within cluster scatter to each component, reducing susceptibility to noise and
Radial e7(a-b)", where smallery smoother boundaries between cluster separation, where lower values are better improving interpretability
Hinge Loss max(0, -
1-
y:(uwr; -b), where w is the margin
width, b is the offset bias, and classes are labeled El. Acts as
Hierarchical Clustering Linear Discriminant Analysis
Clusters data into groups using a predominant hierarchy Supervised method that maximizes separation between classes
the cost function for SVM. Note, even a correct prediction Agglomerative Approach and minimizes variance within classes for a labeled dataset
inside the margin gives loss >0.
1. Each observation starts in its own cluster 1. Compute the mean and variance of each independent
2. Iteratively combine the most similar cluster pairs variable for every class Ci
3. Continue until all points are in the same cluster 2. Calculate the within-class () and between-class (o)
hinge loss
variancee
Divisive Approach all points start in one cluster and splits =
are performed recursively down the hierarchy
3. Find the matrix W (o)="(oz) that maximizes Fisherr's
signal-to-noise ratio
Linkage Metrics measure dissimilarity between clusters
correctly classified
4. Rank the discriminant components by their signal-to-noise
incorrectly classified and combines them using the minimum linkage value over all
ratio A
distance from hyperplane pairwise points in different clusters by comparing:
Note, the number of components is at most C1 - 1
Multiclass Prediction Single the distance between the closest pair of points Assumptions
To classify data with 3+ classes C, a common method is to Complete - the distance between the farthest pair of points Independent variables are normally distributed
binarize the problem through: Ward's- the increase in within-cluster SSE if two clusters Homoscedasticity constant variance of error
-
-
One vs. Rest train a classifier for each class c; by setting were to be combined Low multicollinearity
Ci's samples as 1 and all others as 0, and predict the class Dendrogram plots the full hierarchy of clusters, where the
-
with the highest confidence score height ofa node indicates the dissimilarity between its children
Factor Analysis
One vs. One train models for each pair of classes, Describes data using a linear combination of k latent factors.
and predict the class with the highest number of positive Given a normalized matrix X, it follows the form X = Lf +€,
predictions with factor loadings L and hidden factors f.
data factor loadings Commo factors
k-Nearest Neighbors
Non-parametric method that calculates ý using the average math Scores 13 .95
-1.25 1.88 ... -0.55
value or most common class of its k-nearest points. For reading Scores .78-28
high-dimensional data, information is lost through equidistant 0.71 -0.17 .. -1.20
vectors, so dimension reduction is often applied prior to k-NN. SCience Scores -87 05
Minkowski Distance = (la; b;|P)/ -
pxk
P Xn kxn
p 1 gives Manhattan distance la;-bl
-
=
Scree Plot graphs the eigenvalues of factors (or principal
p 2 gives Euclidean distance V2a; bi)2 components) and is used to determine the number of factors to
Hamming Distance count of the differences between two retain. The 'elbow where values level off is often used as the
vectors, often used to compare categorical variables cutoff.
Aaron Wang
Natural Language Processing Neural Network Convolutional Neural Network
Transforms human language into machine-usable code Feeds inputs through different hidden layers and relies on Analyzes structural or visual data by extracting local features
Processing Techniques weights and nonlinear functions to reach an output Convolutional Layers iterate over windows of the image,
-
1. Initialize weights W randomly with near-zero values many-to-one, one-to-many, and many-to-many. Relies on
unlike word2vec and GloVe, BERT outputs different vectors 2. Loop until convergence: parameter (weight) sharing for efficiency. To avoid redundant
for different uses of words (cell phone vs. blood cell) calculations during backpropagation, downstream gradients
Calculate the average network loss J(W)
Sentiment Analysis are found by chaining previous gradients. However, repeatedly
Backpropagation iterate backwards from the last
-
multiplying values greater than or less than 1 leads to:
Extracts the attitudes and emotions from text layer, computing the gradient and updating the
Polarity measures positive, negative, or neutral opinions Exploding gradients - model instability and overflows
)
weight W+W-7 OW Vanishing gradients - loss of learning ability
Valence shifters - capture amplifiers or negators such as 3. Return the minimum loss weight matrix W
fun' or "hardly
This can be solved using:
really fun'
To prevent overfitting, regularization can be applied by: Gradient clipping cap the maximum value of gradients
-
prevent over-reliance on a single node Long Short-Term Memory learns long-term dependencies
-
Topic Modelling Embedding weight penalties into the objective function using gated cells and maintains a separate cell state from what
Captures the underlying themes that appear in documents Batch Normalization stabilizes learning by normalizing
-
is outputted. Gates in LSTM perform the following:
Latent Dirichlet Allocation (LDA) generates k topics by - inputs to a layer
1. Forget and filter out irrelevant info from previous layers
first assigning each word to a random topic, then iteratively Stochastic Gradient Descent - only uses a single point to 2. Store relevant info from current input
updating assignments based on parameters a, the mix of topics compute gradients, leading to smoother convergence and faster 3. Update the current cell state
per document, and B, the distribution of words per topic compute speeds. Alternatively, mini-batch gradient descent
4. Output the hidden state, a filtered version of the cell state
Latent Semantic Analysis (LSA) identifies patterns using - trains on small subsets of the data, striking a balance between
tf-idf scores and reduces data to k dimensions through SVD the approaches. LSTMs can be stacked to improve performance.
Aaron Wang
Boosting Reinforcement Learning Anomaly Detection
Sequentially fits many simple models that account for the Maximizes future rewards by learning through state-action Identifies unusual patterns that differ from the majority of the
previous model's errors. As opposed to bagging, boosting pairs. That is, an agent performs actions in an enviroment, data. Assumes that anomalies are:
trains on all the data and combines models using the learning which updates the state and provides a reward. Rare the minority class that occurs rarely in the data
rate a. Different have feature values that are very different fromn
-
AdaBoost uses
sample weighting and decision 'stumps agent normal observations
(one-level decision trees) to classify samples Anomaly detection techniques spans a wide range, including
state update
1. Build decision stumps for every feature, choosing the one action methods based on:
with the best classification accuracy reward Statistics relies on various statistical methods to identify
2. Assign more weight to misclassified samples and reward outliers, such as Z-tests, boxplots, interquartile ranges, and
1-TotalError
trees that differentiate them, where a: lnTotal
environment variance comparisons
Error
3. Continue training and weighting decision stumps until Density useful when data is grouped around dense
convergence Multi-armed Bandit Problem a gambler plays slot
-
neighborhoods, measured by distance. Methods include
machines with unknown probability distributions and must k-nearest neighbors, local outlier factor, and isolation forest.
Gradient Boost - trains sequential models by minimizing a decide the best strategy to maximize reward. This exemplifies
given loss function using gradient descent at each step Isolation Forest tree-based model that labels outliers
the exploration-exploitation tradeoff, as the best long-term based on an anomaly score
1. Start by predicting the average value of the response strategy may involve short-term sacrifices. 1. Select a random feature and split value, dividing the
2. Build a tree on the errors, constrained by depth or the RL is divided into two types, with the former being more dataset in two
number of leaf nodes common: 2. Continue splitting randomly until every point is isolated
3. Scale decision trees by a constant learning rate a 3. Calculate the anomaly score for each observation, based
4. Continue training and weighting decision trees until -Model-free - learn through trial and error in the
on how many iterations it took to isolate that point.
environment
convergence 4. If the anomaly score is greater than a threshold, mark it
Model-based access to the underlying (approximate)
XGBoost fast gradient boosting method that utilizes as an outlier
state-reward distribution Intuitively, outliers are easier to isolate and should have
regularization and parallelization
Q-Value Q(s, a) -
captures the expected discounted total shorter path lengths in the tree
Recommender Systems future reward given a state and action Clusters - data points outside of clusters could potentially be
Suggests relevant items to users by predicting ratings and Policy - chooses the best actions for an agent at various states marked as anomalies
preferences, and is divided into two main types: T(s) = arg max Q(s, a)
a Autoencoders unsupervised
-
neural networks that compress
Content Filtering recommends similar items data through an encoder and reconstruct it using a decoder.
-
Model-based Approaches predict ratings of unrated neural networks, an actor T(S, a,6) which controls agent set of n underlying states Q using A = (A, B, T)
A - n X n matrix of transition probabilities from state i toj
items, through methods such as Bayesian networks, SVD, and behavior and a critic Q(s, a, w) that measures how good an
clustering. Handles sparse data better than memory-based action is. Both run in parallel to find the optimal weights 6, w B- sequence of likelihoods of emitting ot in state i
approaches. to maximize expected reward. At each step: T initial probability distribution over states
Matrix Factorization -
decomposes the user-item rating 1. Pass the current state into the actor and critic HMMs can calculate P(O|A), find the best hidden state
matrix into two lower-dimensional matrices
users and items, each with k latent factors
represening
u 2. The critic evaluates the action's -value, and the actor sequence Q, or learn the parameters A and B. Anomalies are
updates its weight 6 observations that are unlikely to occur across states.
Recommender systems can also be combined through ensemble 3. The actor takes the next action leading to a new state, and HMMs can be applied to many problems such as signal
methodsto improve pertormance. the critic updates its weight w processing and part of speech tagging
Aaron Wang
Time Series Statistics A/B Testing
Extracts characteristics from time-sequenced data, which may p-value -
probability an effect could have occurred by
that Examines user experience through randomized tests with two
exhibit the following characteristics: chance. If less than the significance level a, or if the test variants. The typical steps are:
Stationarity -
statistical properties such as mean, variance, statistic is greater than the critical value, then reject the null. 1. Determine the evaluation metric and experiment goals
and auto correlation are constant over time Type I Error (False Positive a) - rejecting a true null 2. Select a significance level a and power threshold - B 1
Type II Error (False Negative 6) - not rejecting a false null 3. Calculate the required sample size per variation
Trend long-term rise or fall in values
Decreasing Type I Error causes an increase in Type II Error 4. Randomly assign users into control and treatment groups
Seasonality variations associated with specific calendar
Confidence Level (1 - a) probability of finding an effect 5. Measure and analyze results using the appropriate test
times, occurring at regular intervals less than a year
that did not occur by chance and avoiding a 'Type I error
Cyclicality - variations without a fixed time length,
Power (1 - B) - probability of picking up on an effect that is The required sample size depends on a, B, and the MDE
Occurring in periods of greater or less than one year Minimum Detectable Effect the target relative minimum
present and avoiding a Type II Error
Autocorrelation - degree of linear sinmilarity between Confidence Interval estimated interval that models the
-
increase over the baseline that should be observed from a test
current and lagged values Overall Evaluation Criterion - quantitative measure of the
long-term frequency of capturing the true parameter value
CV must account for the time aspect, such as for each fold Fa: z-test tests whether normally distributed population means test's objective, commonly used when short and long-term
Sliding Window train F1, test F2, then train F2, test Fa are different, used when n is large and variances are known metrics have inverse relationships
Forward Chain train F1, test F2, then train F1, F2, test F3
- Z-score the number of standard deviations between a data
-
Multivariate Testing compares 3+ variants or
-
- uses an exponentially decreasing point r and the mean combinations, but requires larger sample sizes
Exponential Smoothing
t-test used when population variances are unknown, and
- Bonferroni Correction when conducting n tests, run each
weight to observations over time, and takes a moving average.
converges to the z-test when n is large test at the significance level, which lowers the false positive
The time t output is st = at +(1-a)st-1, where 0 < a < 1.
t-score - uses the standard error as an estimate for rate of finding effects by chance
Double Exponential Smoothing applies a recursive
Network Effects changes that occur due to effect spillover
exponential filter to capture trends within a time series
populat1on variance /. -
Degrees of Freedom - the number of independent (free) from other groups. To detect group interference:
St=at +(1-a){st-1 +bt-1)
bt = B(st -St-1) +(1- B)bt-1 dimensions needed before the parameter estimate can be 1. Split the population into distinct clusters
determined 2. Randomly assign half the clusters to the control and
Triple exponential smoothing adds a third variable y that measure differences between categorical
Chi-Square Tests treatment groups A1 and B1
accounts
ARIMA
for seasonality.
- models time series using three parameters (p, d, g):
variables, using x2 = 2 ee
pe pected
d to test: 3. Randomize the other half at the user-level and assign to
control and treatment groups A2 and B2
-
Goodness of fit if samples of one categorical variable
-
-
Autoregressive -
the past p values affect the next value match the population category expectations Intuitively, if there are network effects, then the tests will
Integrated values are replaced with the diference between if
Independence - being in one category is independet of have different results
current and previous values, using the difference degree d (0 another, based off two categories To account for network users based on time,
1
effects, randomize
for stationary data, and for non-stationary) Homogeneity if different subgroups come from the same
- -
cluster, or location
Moving Average the number of lagged forecast
-
errors and population, based off a single category
the size of the moving average windowg Sequential Testing allows for early experiment stoppingby
ANOVA analysis of variance, used to compare 3+ samples
, drawing statistical borders based on the Type I Error rate. If
-
unexplained the effect reaches a border, the test can be stopped. Used to
F-score -
compares group
the ratio, of explained
variance a and unexplained
and
seasonality-specific parameters: P, LD, and the season variance etween combat peeking (preliminarily checking results of a test),
length s winin 8roup variance
which can inflate p-values and lead to incorrect conclusions.
Prophet additive uses non-linear trends to
model that Conditional Probability P(A | B) = P(B) Cohort Analysis examines specific groups of users based on
-
account for multiple seasonalities such as yearly, weekly, and If A and B are independent, then P(AnB) = P(A)P(B). behavior or time and can help identify whether novelty or
daily. Robust to missing data and handles outliers well. Note, events that are independent of themselves must have primacy effects are present
Can be represented as: y(t) = g(t) + s(t) + h(t) + e(t), with probability either 1 or 0.
four distinct components for the growth over time, seasonality, Union P(A UB) = P(A) + P(B) P(A nB) -
Miscellaneous
holiday effects, and error. This specification is similar to a Mutually Exclusive events cannot happen simultaneously Shapley Values measures the marginal contribution of each
generalized additive model. variable in the output of a model, where the sum of all Shapley
Expected Value E[X] = 2*ipi, with properties
Generalized Additive Model combine predictive methods
- values equals the total value (prediction- mean prediction)
while preserving additivity across variables, in a form such as
-
EX+ Y] = E[X]+ E[Y] SHAP interpretable Shapley method that utilizes both
-
EXY = E[X]E[Y] if X andY are independent global and local importance to model variable explainability
Permutation - order matters(n=KI="P*
n!
non-linear. GAM also provide regularized and interpretable Variance Var(X) = E(X] ElX], with properties -
solutions for regression and classification problems. Var(X +Y) = Var(X)+Var(Y) t 2Cov(X, Y) Combination order doesn't matter
Var(aX tb) = a*Var(X) kI I= "Ck=()
Naive Bayes Covariance measures the direction of the joint linear Left Skew Mean < Median Mode
Classifies data using the label with the highest conditional 2iU)Yi) Right Skew Mean > Median Mode
relationship of two variables n- -1
probability, given data a and classes c. Naive because it TODaDIlty vs Llkelihood given a situation 6 and
-
Aaron Wang
LEARN EVERYTHING AI DIWALI SALE
Build your career as a
Data Scientist from scratch with
DATA SCIENCE & ANALYTICS
COMBO COURSSE
PYTHON sQL MACHINE LEARNING DEEP LEARNING
STATISTIcs POWER BI TABLEAU
COURSE FEATURESo-
Grab this combo courseat
Hands-on Practical Experience
1-1 Doubt Clearance
Course Completion certificate
Real world Capstone projects
269947999
Interview QnAs PDE ENROLL NOW
No prior coding experience required
One time payment, lifetime access
Follow me on
Shivam Modi
Founder & CEO
www.learneverythingai.com
in
SALE
DIWALI
COMBO COURSE
Available at just
25994 (033742) R7999 ($99.99).