0% found this document useful (0 votes)
4 views

Data science cheat sheet

This document is a comprehensive cheatsheet for data science concepts, covering distributions, model evaluation, logistic regression, decision trees, support vector machines, and neural networks. It includes definitions, formulas, and key assumptions for various statistical models and machine learning techniques. The content is structured to provide quick references for practitioners in data science and machine learning.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Data science cheat sheet

This document is a comprehensive cheatsheet for data science concepts, covering distributions, model evaluation, logistic regression, decision trees, support vector machines, and neural networks. It includes definitions, formulas, and key assumptions for various statistical models and machine learning techniques. The content is structured to provide quick references for practitioners in data science and machine learning.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Data Science Cheatsheet 2.

0
Last Updated June 19, 2021

Distributions Model Evaluation Logistic Regression


Discrete Regression Predicts probability that y belongs to a binary class.
Estimates through maximum likelihood estimation (MLE)
Binomial
)Pq"
- a successes in n events, each with p probability
*, With = np and o = npq
If n = 1, this is a Bernoulli distribution
Geometric first success with p probability on the ntn trial
-
Mean Squared Error (MSE)
Sum of Squared Error (SSE)=
Total Sum of Squares (SST)
-i)

R' =1-3ST the proportion of explained y-variability


=2vi
- by fitting a logistic (sigmoid) function to the data. This is
equivalent to minimizing the cross entropy loss. Regularization
can be added in the exponent.
P(Y =
1)1+e-(#otr
gp, with = 1/p and o = p Note, negative R means the model is worse than just
Negative Binomial number of failures before r successes
- predicting the mean. R is not valid for nonlinear models, as The threshold a classifies predictions as either 1 or 0
- r successes in n draws, no replacement, SSpesidual+SSerror # Assumptions
Hypergeometric SST between X and log-odds of Y
from a size N population with X items of that feature Adjusted R° = 1-(1-R) N-1I, which changes only when Linear relationship
P-1
Independent observations
predictors affect R above what would be expected by chance
()
with = " Low multicollinearity
Classification Odds- output probability can be transformed using
Poisson
success
-

OCcurs
number of successes
at an average rate
T in
A
a fixed
> ,
time interval,
with u =o
where
=A Actual Yes| True Positive (1- 8)
Predict Yes Predict No
False Negative (B)
Odds(Y = 1) = PT where P(3) = 1:2 odds
Coefficients are linearly related to odds, such that a one unit
Continuous Actual No False Positive (a) True Negative (1oa) increase in *1 affects odds by el
Uniform- all values between a and b are equally likely TP
-Precision=TP+FP: Percent correct when predict positive
Decision Trees
with
=
and o2
=
or 1if
2
discre a -Recall, TP
Sensitivity = TP4FN: Percent of actual positives
identified correctly. (True Positive Rate) Classification and Regression Tree
Normal/Gaussian N(j4, o), Standard Normal Z ~ N(0,1)

-
Central Limit Theorem sample mean of i.i.d. data
approaches normal distribution
-

Empirical Rule 68%, 95%, and 99.7% of values lie within


one, two, and three standard deviations of the mean
Specificity =

ROC Curve
o
Fi =2 precsion+recall useful
TNE
,
percent of actual negatives identified
correctly, also 1 FPR (True Negative Rate)
s when classes are imbalanced
plots TPR vs. FPR for every threshold a. Area
-
CART for regression minimizes SSE by splitting data into
sub-regions and predicting the average value at leaf nodes.
The complexity parameter cp only keeps splits that reduce loss
by at least cp (small cp > deep tree)
Normal Approximation discrete distributions such as
-

Under the Curve measures how likely the model differentiates width <0.6
petal
Binomial and Poisson can be approximated using Z-Scores
positives and negatives (perfect AUC =1, baseline = 0.5).
when np, nq, and X are greater than 10 Precision-Recall Curve focuses on the correct prediction
-

Exponential memoryless time between independent events


- of the minority class, useful when data is imbalanced setosa petallength < 1.7
yes no
occurring at an average rateA> Ae", with ju = leaf nodes
Gamma - time until n independent events OCcurring at an Linear Regression versicolor virginica
average rate A Models linear relationships between a continuous response and
explanatory variables CART for classification minimizes the sum of region impurity,
Concepts Ordinary Least Squares find for i = Bo +BX +e by where pi is the probability of a sample being in category i.
(X"X)=1X'Y Possible measures, each with a max impurity of 0.5.
Prediction Error = Biass + Variance + Irreducible Noise solving B = which minimizes the SSE
Gini Impurity = 1-2%)
Bias wrong assumptions when training can't capture Assumptions
underlying patterns underfit Linear relationship and independent observations Cross Entropy=- 2BiJlog2(9:)
Variance sensitive to fluctuations when training> can't Homoscedasticity - error terms have constant variance At each leaf node, CART predicts the most frequent category,
generalize on unseen data overfit Errors are uncorrelated and normally distributed assuming false negative and false positive costs are the same.
The bias-variance tradeoff attempts to minimize these two -
Low multicollinearity The splitting process handles multicollinearity and outliers.
sources of error, through methods such as: Variance Inflation Factor measures the severity of
- Trees are prone to high variance, so tune through CV.
Cross validation to generalize to unseen data multicollinearity RZ: where R; is found by regressing Random Forest
Dimension reduction and feature selection X; against all other variables (a common VIF cutoff is 10)
Trains an ensemble of trees that vote for the final prediction
In all cases, as variance decreases, bias increases. Regularization - sampling with replacement (will contain
Bootstrapping
ML models can be divided into two types: Add a penalty A for large coefficients to the cost function,
- uses a fixed number of parameters with duplicates), until the sample is as large as the training set
Parametric which reduces overfitting. Requires normalized data. Bagging training independent models on different subsets of
respect to sample size Subset (Lo): A||B|lo = A(number of non-zero variables) the data, which reduces variance. Each tree is trained on
Non-Parametric uses a flexible number of parameters and
-
Computationally slow, need to fit 2 models
-

63% of the data, so the out-of-bag 37% can estimate


doesn't make particular assumptions on the data Alternatives: forward and backward stepwise selection
= prediction error without resorting to CV.
Cross Validation validates test error with a subset of
- LASSO (LI): A||B||1 A21B| Deep trees may overfit, but adding more trees does not cause
-Shrinks coefficients to zero, and is robust to outliers overfitting. Model bias is always equal to one of its individual
training data, and selects parameters to maximize average
performance Ridge (L2): A||B||2 =
-Reduces
A2B
effects of multicollinearity
trees.
-
k-fold divide data into k groups, and use one to validate
- Variable Importance ranks variables by their ability to
leave-p-out - use p samples to validate and the rest to train Combining LASSO and Ridge gives Elastic Net minimize error when split upon, averaged across all trees
Aaron Wang
Support Vector Machines Clustering Dimension Reduction
Separates data between two classes by maximizing the margin Unsupervised, non-parametric methods that groups similar High-dimensional data can lead to the curse of dimensionality,
between the hyperplane and the nearest data points of any data points together based on distance which increases the risk of overfitting and decreases the value
class. Relies on the following& added. The number of samples for each feature combination
k-Means quickly becomes sparse, reducing model performance.
Randomly place k centroids across normalized data, and assig
=0=9- observations to the nearest centroid. Recalculate centroids as
Principal Component Analysis

,n
the mean of assignments and repeat until convergence. Using
the median or medoid (actual data point) may be more robust
Projects data onto orthogonal vectors that maximize variance.
emember, given an n X n matrix A, a nonzero vector and
,
to noise and outliers. k-modes is used for categorical data. a scaler A, if A = AT then and A are an eigenvector and
k-meanst+ improves selection of initial clusters eigenvalue of A. In PCA, the eigenvectors are uncorrelated
and represent principal components.
1. Pick the first center randomly
support vectors - 1. Start with the covariance matrix of standardized data
2. Compute distance between points and the nearest center
2. Calculate eigenvalues and eigenvectors using SVD or
3. Choose new center using a weighted probability
eigendecomposition
distribution proportional to distance
Support Vector Classifiers account for outliers through
-
3. Rank the principal components by their proportion of
4. Repeat until k centers are chosen
the regularization parameter C, which penalizes variance explained =
misclassifications in the margin by a factor of C> 0 Evaluating the number of clusters and performance:
Data should be linearly related, and for a p-dimensional
Kernel Functions solve nonlinear problems by computing
- Silhouette Value measures how similar a data point is to
-
dataset, there will be p principal components.
the similarity between points a, b and mapping the data to a its own cluster compared to other clusters, and ranges from 1
Note, PCA explains the variance in X, not necessarily Y.
higher dimension. Common functions: (best) to -1 (worst). Sparse PCA - constrains the number of non-zero values in
-
Polynomial (ab+r)" Davies-Bouldin Index ratio of within cluster scatter to each component, reducing susceptibility to noise and
Radial e7(a-b)", where smallery smoother boundaries between cluster separation, where lower values are better improving interpretability
Hinge Loss max(0, -

1-
y:(uwr; -b), where w is the margin
width, b is the offset bias, and classes are labeled El. Acts as
Hierarchical Clustering Linear Discriminant Analysis
Clusters data into groups using a predominant hierarchy Supervised method that maximizes separation between classes
the cost function for SVM. Note, even a correct prediction Agglomerative Approach and minimizes variance within classes for a labeled dataset
inside the margin gives loss >0.
1. Each observation starts in its own cluster 1. Compute the mean and variance of each independent
2. Iteratively combine the most similar cluster pairs variable for every class Ci
3. Continue until all points are in the same cluster 2. Calculate the within-class () and between-class (o)
hinge loss
variancee
Divisive Approach all points start in one cluster and splits =
are performed recursively down the hierarchy
3. Find the matrix W (o)="(oz) that maximizes Fisherr's
signal-to-noise ratio
Linkage Metrics measure dissimilarity between clusters
correctly classified
4. Rank the discriminant components by their signal-to-noise
incorrectly classified and combines them using the minimum linkage value over all
ratio A
distance from hyperplane pairwise points in different clusters by comparing:
Note, the number of components is at most C1 - 1
Multiclass Prediction Single the distance between the closest pair of points Assumptions
To classify data with 3+ classes C, a common method is to Complete - the distance between the farthest pair of points Independent variables are normally distributed
binarize the problem through: Ward's- the increase in within-cluster SSE if two clusters Homoscedasticity constant variance of error
-

-
One vs. Rest train a classifier for each class c; by setting were to be combined Low multicollinearity
Ci's samples as 1 and all others as 0, and predict the class Dendrogram plots the full hierarchy of clusters, where the
-

with the highest confidence score height ofa node indicates the dissimilarity between its children
Factor Analysis
One vs. One train models for each pair of classes, Describes data using a linear combination of k latent factors.
and predict the class with the highest number of positive Given a normalized matrix X, it follows the form X = Lf +€,
predictions with factor loadings L and hidden factors f.
data factor loadings Commo factors
k-Nearest Neighbors
Non-parametric method that calculates ý using the average math Scores 13 .95
-1.25 1.88 ... -0.55
value or most common class of its k-nearest points. For reading Scores .78-28
high-dimensional data, information is lost through equidistant 0.71 -0.17 .. -1.20
vectors, so dimension reduction is often applied prior to k-NN. SCience Scores -87 05
Minkowski Distance = (la; b;|P)/ -
pxk
P Xn kxn
p 1 gives Manhattan distance la;-bl
-
=
Scree Plot graphs the eigenvalues of factors (or principal
p 2 gives Euclidean distance V2a; bi)2 components) and is used to determine the number of factors to
Hamming Distance count of the differences between two retain. The 'elbow where values level off is often used as the
vectors, often used to compare categorical variables cutoff.
Aaron Wang
Natural Language Processing Neural Network Convolutional Neural Network
Transforms human language into machine-usable code Feeds inputs through different hidden layers and relies on Analyzes structural or visual data by extracting local features
Processing Techniques weights and nonlinear functions to reach an output Convolutional Layers iterate over windows of the image,
-

applying weights, bias, and an activation function to create


Tokenization splits text into individual words (tokens)
feature maps. Different weights lead to different features maps.
Lemmatization reduces words to its base form based on
-

dictionary definition (am, are, is > be) window

Stemming reduces words to its base form without context


ended end) input hidden layers output
Stop words removes common and irrelevant words ( the, is)
-
Perceptron the foundation of a neural network that
Markov stochastic and memoryless process that
Chain -
multiplies inputs by weights, adds bias, and feeds the result z
image feature map
predicts future events based only on the current state toan activation function
n-gram predicts
- the next term in a sequence of n terms Activation Function defines a node's output Pooling downsamples convolution layers to reduce
based on Markov chains dimensionality and maintain spatial invariance, allowing
- represents text using word frequencies, Sigmoid ReLlLU Tanh detection of features even if they have shifted slightly.
Bag-of-words
max(0, 2) e-e_ Common techniques return the max or average value in the
without context or order 1+e e +e
tf-idf -
measures
word importance for a document in a pooling window.
collection (corpus), by multiplying the term frequency The general CNN architecture is as follows:
(occurrences of a term in a document) with the inverse 1. Perform a series of convolution, ReLU, and pooling
document frequency (penalizes common terms across a corpus) operations, extracting important features from the data
Cosine Similarity -
measures similarity between vectors, 2. Feed output into a fully-connected layer for classification,
calculated as cos(0) = A B
TAILIBT: Which ranges from o to l
object detection, or other structural analyses
Word Embedding Softmax given final layer outputs, provides class
-
Recurrent Neural Network
Maps words and phrases to numerical vectors
word2vec trains iteratively over local word context
probabilities that sum to l> ez
If there is more than one 'correct' label, the sigmoid function
Predicts sequential data using a temporally connected system
captures both new inputs and previous outputs using
that
windows, places similar words close together, and embeds provides probabilities for all, some, or none of the labels. hidden states
sub-relationships directly into vectors, such that - measures prediction
king- man +woman queen Loss Function error using functions y
such as MSE for regression and binary cross-entropy for
Relies on one of the following:
probability-based classification
W W W
Continuous bag-of-words (CBOW) - predicts the word W
h
Gradient Descent minimizes the average loss by moving
given its context
- predicts iteratively in the direction of steepest descent, controlled by AWx Wx w.
skip-gram the context given a word
the learning rate y (step size). Note, 7 can be updated
GloVe combines both global and local word co-occurence adaptively for better performance. For neural networks,
data to learn word similarity finding the best set of weights involves: RNNs can model various input-output scenarios, such as
BERT accounts for word order and trains on subwords, and
-

1. Initialize weights W randomly with near-zero values many-to-one, one-to-many, and many-to-many. Relies on
unlike word2vec and GloVe, BERT outputs different vectors 2. Loop until convergence: parameter (weight) sharing for efficiency. To avoid redundant
for different uses of words (cell phone vs. blood cell) calculations during backpropagation, downstream gradients
Calculate the average network loss J(W)
Sentiment Analysis are found by chaining previous gradients. However, repeatedly
Backpropagation iterate backwards from the last
-
multiplying values greater than or less than 1 leads to:
Extracts the attitudes and emotions from text layer, computing the gradient and updating the
Polarity measures positive, negative, or neutral opinions Exploding gradients - model instability and overflows
)
weight W+W-7 OW Vanishing gradients - loss of learning ability
Valence shifters - capture amplifiers or negators such as 3. Return the minimum loss weight matrix W
fun' or "hardly
This can be solved using:
really fun'
To prevent overfitting, regularization can be applied by: Gradient clipping cap the maximum value of gradients
-

Sentiment measures emotional states such as happy or sad


Stopping training when validation performance drops RelLU its derivative prevents gradient shrinkage for > 0
Subject-Object Identification classifies sentences as -

either subjective or objective


-
Dropout randomly drop some nodes during training to
- Gated cells regulate the flow of information
-

prevent over-reliance on a single node Long Short-Term Memory learns long-term dependencies
-

Topic Modelling Embedding weight penalties into the objective function using gated cells and maintains a separate cell state from what
Captures the underlying themes that appear in documents Batch Normalization stabilizes learning by normalizing
-
is outputted. Gates in LSTM perform the following:
Latent Dirichlet Allocation (LDA) generates k topics by - inputs to a layer
1. Forget and filter out irrelevant info from previous layers
first assigning each word to a random topic, then iteratively Stochastic Gradient Descent - only uses a single point to 2. Store relevant info from current input
updating assignments based on parameters a, the mix of topics compute gradients, leading to smoother convergence and faster 3. Update the current cell state
per document, and B, the distribution of words per topic compute speeds. Alternatively, mini-batch gradient descent
4. Output the hidden state, a filtered version of the cell state
Latent Semantic Analysis (LSA) identifies patterns using - trains on small subsets of the data, striking a balance between
tf-idf scores and reduces data to k dimensions through SVD the approaches. LSTMs can be stacked to improve performance.
Aaron Wang
Boosting Reinforcement Learning Anomaly Detection
Sequentially fits many simple models that account for the Maximizes future rewards by learning through state-action Identifies unusual patterns that differ from the majority of the
previous model's errors. As opposed to bagging, boosting pairs. That is, an agent performs actions in an enviroment, data. Assumes that anomalies are:
trains on all the data and combines models using the learning which updates the state and provides a reward. Rare the minority class that occurs rarely in the data
rate a. Different have feature values that are very different fromn
-

AdaBoost uses
sample weighting and decision 'stumps agent normal observations
(one-level decision trees) to classify samples Anomaly detection techniques spans a wide range, including
state update
1. Build decision stumps for every feature, choosing the one action methods based on:
with the best classification accuracy reward Statistics relies on various statistical methods to identify
2. Assign more weight to misclassified samples and reward outliers, such as Z-tests, boxplots, interquartile ranges, and
1-TotalError
trees that differentiate them, where a: lnTotal
environment variance comparisons
Error
3. Continue training and weighting decision stumps until Density useful when data is grouped around dense
convergence Multi-armed Bandit Problem a gambler plays slot
-
neighborhoods, measured by distance. Methods include
machines with unknown probability distributions and must k-nearest neighbors, local outlier factor, and isolation forest.
Gradient Boost - trains sequential models by minimizing a decide the best strategy to maximize reward. This exemplifies
given loss function using gradient descent at each step Isolation Forest tree-based model that labels outliers
the exploration-exploitation tradeoff, as the best long-term based on an anomaly score
1. Start by predicting the average value of the response strategy may involve short-term sacrifices. 1. Select a random feature and split value, dividing the
2. Build a tree on the errors, constrained by depth or the RL is divided into two types, with the former being more dataset in two
number of leaf nodes common: 2. Continue splitting randomly until every point is isolated
3. Scale decision trees by a constant learning rate a 3. Calculate the anomaly score for each observation, based
4. Continue training and weighting decision trees until -Model-free - learn through trial and error in the
on how many iterations it took to isolate that point.
environment
convergence 4. If the anomaly score is greater than a threshold, mark it
Model-based access to the underlying (approximate)
XGBoost fast gradient boosting method that utilizes as an outlier
state-reward distribution Intuitively, outliers are easier to isolate and should have
regularization and parallelization
Q-Value Q(s, a) -
captures the expected discounted total shorter path lengths in the tree
Recommender Systems future reward given a state and action Clusters - data points outside of clusters could potentially be
Suggests relevant items to users by predicting ratings and Policy - chooses the best actions for an agent at various states marked as anomalies
preferences, and is divided into two main types: T(s) = arg max Q(s, a)
a Autoencoders unsupervised
-
neural networks that compress
Content Filtering recommends similar items data through an encoder and reconstruct it using a decoder.
-

Deep RL algorithms can further be divided into two main


Collaborative Filtering recommends what similar users like Autoencoders do not reconstruct the data perfectly, but rather
-
types, depending on their learning objective
focus on capturing important features in the data.
The latter is more common, and includes methods such as: Value Learning aims to approximate
-
(s, a) for all actions
Memory-based Approaches finds neighborhoods by using the agent can take, but is restricted to discrete action spaces. enco decoder
rating to compute user and item similarity,
data measured Can use the e-greedy method, where e measures the
using correlation or cosine similarity probability of exploration. If chosen, the next action is
selected uniformly at random.
User-User - similar users also liked.
Q-Learning simple value iteration model that maximizes
- input output
Leads to more diverse recommendations, as opposed to
the Q-value using a table on states and actions
just recommending popular items
Deep Q Network finds the best action to take by compressed
Suffers from sparsity, as the number of users who rate
minimizing the Q-loss, the squared error between the target representation
items is often low
Q-value and the prediction
Item-Item similar users who liked this item also liked..
-

Policy Gradient Learning -


directly optimize the the policy The decoder struggles to capture anomalous patterns, and the
Efficient when there are more users than items, since the error acts as a score to detect anomalies.
T(s) through a probability distribution of actions, without the reconstruction
item neighborhoods update less frequently than users
need for a value function, allowing for continuous action Autoencoders can also be used for image processing, dimension
Similarity between items is often more reliable than
spaces. reduction, and intormation retrieval.
similarity between users
Actor-Critic Model - hybrid algorithm that relies on two Hidden Markov Model uses observed events O to model a
-

Model-based Approaches predict ratings of unrated neural networks, an actor T(S, a,6) which controls agent set of n underlying states Q using A = (A, B, T)
A - n X n matrix of transition probabilities from state i toj
items, through methods such as Bayesian networks, SVD, and behavior and a critic Q(s, a, w) that measures how good an
clustering. Handles sparse data better than memory-based action is. Both run in parallel to find the optimal weights 6, w B- sequence of likelihoods of emitting ot in state i
approaches. to maximize expected reward. At each step: T initial probability distribution over states
Matrix Factorization -
decomposes the user-item rating 1. Pass the current state into the actor and critic HMMs can calculate P(O|A), find the best hidden state
matrix into two lower-dimensional matrices
users and items, each with k latent factors
represening
u 2. The critic evaluates the action's -value, and the actor sequence Q, or learn the parameters A and B. Anomalies are
updates its weight 6 observations that are unlikely to occur across states.
Recommender systems can also be combined through ensemble 3. The actor takes the next action leading to a new state, and HMMs can be applied to many problems such as signal
methodsto improve pertormance. the critic updates its weight w processing and part of speech tagging

Aaron Wang
Time Series Statistics A/B Testing
Extracts characteristics from time-sequenced data, which may p-value -
probability an effect could have occurred by
that Examines user experience through randomized tests with two
exhibit the following characteristics: chance. If less than the significance level a, or if the test variants. The typical steps are:

Stationarity -
statistical properties such as mean, variance, statistic is greater than the critical value, then reject the null. 1. Determine the evaluation metric and experiment goals
and auto correlation are constant over time Type I Error (False Positive a) - rejecting a true null 2. Select a significance level a and power threshold - B 1
Type II Error (False Negative 6) - not rejecting a false null 3. Calculate the required sample size per variation
Trend long-term rise or fall in values
Decreasing Type I Error causes an increase in Type II Error 4. Randomly assign users into control and treatment groups
Seasonality variations associated with specific calendar
Confidence Level (1 - a) probability of finding an effect 5. Measure and analyze results using the appropriate test
times, occurring at regular intervals less than a year
that did not occur by chance and avoiding a 'Type I error
Cyclicality - variations without a fixed time length,
Power (1 - B) - probability of picking up on an effect that is The required sample size depends on a, B, and the MDE
Occurring in periods of greater or less than one year Minimum Detectable Effect the target relative minimum
present and avoiding a Type II Error
Autocorrelation - degree of linear sinmilarity between Confidence Interval estimated interval that models the
-
increase over the baseline that should be observed from a test
current and lagged values Overall Evaluation Criterion - quantitative measure of the
long-term frequency of capturing the true parameter value
CV must account for the time aspect, such as for each fold Fa: z-test tests whether normally distributed population means test's objective, commonly used when short and long-term
Sliding Window train F1, test F2, then train F2, test Fa are different, used when n is large and variances are known metrics have inverse relationships
Forward Chain train F1, test F2, then train F1, F2, test F3
- Z-score the number of standard deviations between a data
-
Multivariate Testing compares 3+ variants or
-

- uses an exponentially decreasing point r and the mean combinations, but requires larger sample sizes
Exponential Smoothing
t-test used when population variances are unknown, and
- Bonferroni Correction when conducting n tests, run each
weight to observations over time, and takes a moving average.
converges to the z-test when n is large test at the significance level, which lowers the false positive
The time t output is st = at +(1-a)st-1, where 0 < a < 1.
t-score - uses the standard error as an estimate for rate of finding effects by chance
Double Exponential Smoothing applies a recursive
Network Effects changes that occur due to effect spillover
exponential filter to capture trends within a time series
populat1on variance /. -

Degrees of Freedom - the number of independent (free) from other groups. To detect group interference:
St=at +(1-a){st-1 +bt-1)
bt = B(st -St-1) +(1- B)bt-1 dimensions needed before the parameter estimate can be 1. Split the population into distinct clusters
determined 2. Randomly assign half the clusters to the control and
Triple exponential smoothing adds a third variable y that measure differences between categorical
Chi-Square Tests treatment groups A1 and B1
accounts
ARIMA
for seasonality.
- models time series using three parameters (p, d, g):
variables, using x2 = 2 ee
pe pected
d to test: 3. Randomize the other half at the user-level and assign to
control and treatment groups A2 and B2
-
Goodness of fit if samples of one categorical variable
-

-
Autoregressive -
the past p values affect the next value match the population category expectations Intuitively, if there are network effects, then the tests will
Integrated values are replaced with the diference between if
Independence - being in one category is independet of have different results
current and previous values, using the difference degree d (0 another, based off two categories To account for network users based on time,
1
effects, randomize
for stationary data, and for non-stationary) Homogeneity if different subgroups come from the same
- -

cluster, or location
Moving Average the number of lagged forecast
-
errors and population, based off a single category
the size of the moving average windowg Sequential Testing allows for early experiment stoppingby
ANOVA analysis of variance, used to compare 3+ samples
, drawing statistical borders based on the Type I Error rate. If
-

SARIMA models seasonality through four additional


-

unexplained the effect reaches a border, the test can be stopped. Used to
F-score -
compares group
the ratio, of explained
variance a and unexplained
and
seasonality-specific parameters: P, LD, and the season variance etween combat peeking (preliminarily checking results of a test),
length s winin 8roup variance
which can inflate p-values and lead to incorrect conclusions.
Prophet additive uses non-linear trends to
model that Conditional Probability P(A | B) = P(B) Cohort Analysis examines specific groups of users based on
-

account for multiple seasonalities such as yearly, weekly, and If A and B are independent, then P(AnB) = P(A)P(B). behavior or time and can help identify whether novelty or
daily. Robust to missing data and handles outliers well. Note, events that are independent of themselves must have primacy effects are present
Can be represented as: y(t) = g(t) + s(t) + h(t) + e(t), with probability either 1 or 0.
four distinct components for the growth over time, seasonality, Union P(A UB) = P(A) + P(B) P(A nB) -
Miscellaneous
holiday effects, and error. This specification is similar to a Mutually Exclusive events cannot happen simultaneously Shapley Values measures the marginal contribution of each
generalized additive model. variable in the output of a model, where the sum of all Shapley
Expected Value E[X] = 2*ipi, with properties
Generalized Additive Model combine predictive methods
- values equals the total value (prediction- mean prediction)
while preserving additivity across variables, in a form such as
-

EX+ Y] = E[X]+ E[Y] SHAP interpretable Shapley method that utilizes both
-

y = Bo +fi(t1) + + fm (m), where functions can be


-

EXY = E[X]E[Y] if X andY are independent global and local importance to model variable explainability
Permutation - order matters(n=KI="P*
n!
non-linear. GAM also provide regularized and interpretable Variance Var(X) = E(X] ElX], with properties -

solutions for regression and classification problems. Var(X +Y) = Var(X)+Var(Y) t 2Cov(X, Y) Combination order doesn't matter
Var(aX tb) = a*Var(X) kI I= "Ck=()
Naive Bayes Covariance measures the direction of the joint linear Left Skew Mean < Median Mode
Classifies data using the label with the highest conditional 2iU)Yi) Right Skew Mean > Median Mode
relationship of two variables n- -1
probability, given data a and classes c. Naive because it TODaDIlty vs Llkelihood given a situation 6 and
-

Correlation normalizes covariance to provide both strength


-

assumes variables are independent. observed outcomes O, probability is calculated as P(O|0).


and direction of linear relationships> r =OUt
Bayes' Theorem P(c;|a) = However, when true values for 6 are unknoWn, O is used to
Independent variables are uncorrelated, though the inverse is estimate the 6 that maximizes the likelihood function. That is,
Gaussian Naive Bayes calculates conditional probability
not necessarily true L(O|0) = P(O]0).
for continuous data by assuming a normal distribution

Aaron Wang
LEARN EVERYTHING AI DIWALI SALE
Build your career as a
Data Scientist from scratch with
DATA SCIENCE & ANALYTICS
COMBO COURSSE
PYTHON sQL MACHINE LEARNING DEEP LEARNING
STATISTIcs POWER BI TABLEAU

COURSE FEATURESo-
Grab this combo courseat
Hands-on Practical Experience
1-1 Doubt Clearance
Course Completion certificate
Real world Capstone projects
269947999
Interview QnAs PDE ENROLL NOW
No prior coding experience required
One time payment, lifetime access
Follow me on
Shivam Modi
Founder & CEO
www.learneverythingai.com
in
SALE
DIWALI

Python Machine Deep SQL


Learning Learning

Power Bi Tableau Statistics Interview


QnA PDF

DATA SCIENCE & ANALYTICS

COMBO COURSE

Available at just
25994 (033742) R7999 ($99.99).

You might also like