General ML Notes
General ML Notes
ML Case Study
● Ask questions about the business and how ML is used
● Ask about current model and metrics and features available
● Break down the problem
○ Think about the metric
■ Is data imbalance
○ Model selection
■ Is explainability important
■ How much data do we have
■ Training and prediction times for production
■ Accuracy
○ Post production shadowing
AB Testing
● Steps
○ Problem Statement: What is the goal of the experiment
■ Understand the business problem(Eg: Change in recommendation algo)
■ What is the intended effect of the change
■ Understand the user journey of the given usecase
■ User the user journey to create the success metric(Eg: Revenue per user
per day)
● Is the metric measurable
● Is it attributable(can assign cause)
● Is it sensitive and variable to distinguish treatment from control
● Is it timely
○ Hypothesis testing: What result do you hypothesize from the experiment
■ State the null and alternative hypothesis
■ Set the alpha/signifiance level (0.05)
■ Set the statistical power (0.80)
■ Set the minimum detectable effect(MDE). Usually 1% for big incs.
○ Design the experiment: What are your experiment parameters
■ Set the randomization unit(Eg: User)
■ Target the population in the experiment from the user journey/funnel(Eg:
users that search a product)
■ Determine the sample size
Normalization Standardization
Rescales values to a range between 0 and Centers data around the mean and scales to a
1 standard deviation of 1
Useful when the distribution of the data is Useful when the distribution of the data is
unknown or not Gaussian Gaussian or unknown
Retains the shape of the original distribution Changes the shape of the original distribution
May not preserve the relationships between Preserves the relationships between the data
the data points points
○
Where x(bar) is mean, z is z value from table for given C.I., Delta is std. Dev., n is
sample size
○
Advantages and Disadvantages of
Traditional Machine Learning
Algorithms
Algorithm Advantage Disadvantage
Linear Regression
● Parametric Form
𝑌=𝛽0+𝛽1 𝑋1+𝛽2 𝑋2+…+𝛽𝑝 𝑋𝑝+𝜀
● Residuals are the difference from the regression line to the actual value of the data point
● Cost Function
○ We use mean squared error which is derived from maximum likelihood estimator
○
● Assumptions
Outliers Discretization
Winsorization
● Basics
○ Used when the target variable is binary
○ Used to predict the probabilities for classification problems
● Why Logistic Regression rather than Linear Regression
○ Dependent variable is binary instead of continuous
○ Adding outliers will cause the best fit line to shift to fit that point
○ Predicted values from linear regression can be out of range or probabilities(0, 1)
● Logistic Function
odds
■ Why do we take odds?
■ Log of odds is taken to equalize the range of the output. Since odds will
be (0, inf) and log(odds) will be (-inf , +inf)
■
● Cost Function
○ The Yi in logistic regression is a non-linear function and will give non-convex
graph
Problem is that we can get results from this but it may be a local minima
○ We can derive log loss using maximum likelihood estimation
● Assumptions
○ Target is binary
■ Check by simply counting target variable types
○ Observations are independent
■ Check with residual plot analysis
○ There is no multicollinearity among features
■ Check with VIF or other such methods
○ No extreme outliers
■ Check with cook’s distance and choose how to deal with the variables
○ There is a Linear Relationship Between Explanatory Variables and the Logit of
the Response Variable
■ Check with Box-Tidwell test
○ The Sample Size is Sufficiently Large
● Assumptions of Logistic Regression vs. Linear Regression
○ In contrast to linear regression, logistic regression does not require:
■ A linear relationship between the explanatory variable(s) and the
response variable.
■ The residuals of the model to be normally distributed.
■ The residuals to have constant variance, also known as homoscedasticity.
Maximum Likelihood Estimation
● Basics
○ Density estimation is used to estimate the probability distribution for a sample of
observations from a problem domain.
○ MLE is a framework/technique used for solving density estimation.
○ It defines a likelihood function for calculating the conditional probability of
observing the data sample given a probability distribution and distribution
parameters.
○
○
●
P-value and Hypothesis Testing
Hypothesis testing is a fundamental concept in statistics and plays a crucial role in data
analysis, decision-making, and scientific research. It enables us to draw conclusions about
populations based on sample data. One important aspect of hypothesis testing is the calculation
and interpretation of p-values.
Interpreting P-values:
● If the p-value is below a predetermined threshold (e.g., 0.05), the result is considered
statistically significant, and we reject the null hypothesis in favor of the alternative
hypothesis.
● If the p-value is above the threshold, we fail to reject the null hypothesis due to
insufficient evidence to support the alternative hypothesis.
The p-value is the probability of observing the results of the Null Hypothesis. The significance
level is the target value, which should be achieved if we want to retain the Null Hypothesis.
Hence, as long as the p-value is less than the significance level, we must reject the null
hypothesis. Lower p-value means, the population or the entire data has strong evidence against
the null hypothesis.
● Type I Error :
This error occurs when the decision to reject the null hypothesis goes wrong. In this error, the
alternative hypothesis H₁ is chosen, when the null hypothesis H₀ is true. This is also called False
Positive.
Type I error is often denoted by alpha α i.e. significance level. alpha α is the threshold for the
percentage of Type I errors we are willing to commit.
● Type II Error :
This error occurs when the decision to retain the null hypothesis goes wrong. In this error, the
null hypothesis H₀ is chosen, when the alternative hypothesis H₁ is true. This error is also called
False Negative.
However, these errors are always present in the statistical tests and must be kept in mind while
interpreting the results.
T-test, Z-test, F-test, chi-square test,
ANOVA
● Z-score/test: Tells how many standard deviations from the mean the result is. Z score of
1 means 1 delta, 2 means 2 delta, etc.
○ Simply put, a z-score (also called a standard score) gives you an idea of how far
from the mean a data point is. But more technically it’s a measure of how many
standard deviations below or above the population mean a raw score is
○ It enables us to compare two scores from different samples
○ Within 1 delta on each side = 68% of the population
○ Within 2 delta on each side = 95% of the population
○ Within 3 delta on each side = 99.7% of the population
○
○ When to use:
■ If standard deviation of the population is known
■ If sample size if above 30
○
● T-tests:
○ T-tests are very much similar to the z-scores, the only difference being that
instead of the Population Standard Deviation, we now use the Sample Standard
Deviation.
Metrics
●
●
2. From KDnuggets
It is very important to have an ecosystem to build, test, deploy, and maintain the
enterprise grade machine learning models in production environments. The ML model
development involves data acquisition from multiple trusted sources, data processing to
make suitable for building the model, choose algorithm to build the model, build model,
compute performance metrics and choose best performing model. The model
maintenance plays critical role once the model is deployed into production.
Phases
a. Model Development
● EDA steps
● Choosing the right algorithm
b. Model Operations
ML Model Deployment Best Practices: The recommended model deployment best practices
are
(1) Automate the steps required for ML model development and deployment, by leveraging
DevOps tools that gives some extra time for model retraining;
(2) Perform continuous model testing, performance monitoring and retraining after its production
deployment to keep the model relevant/current as the source data changes to predict the
desired outcome(s);
(3) Implement logging while exposing ML models as APIs, that includes capturing input
features/model output (to track model drift), application context (for debugging production
errors), model version (if multiple re-trained models are deployed in production);
(4) Manage all the model metadata in a single repository.
Liquidity Risk and Analytics
● Liquidity risk occurs when a financial institution is unable to meet its short-term debt
obligations
● Particular worry during periods of market stress(changes in risk appetite, interest in
certain assets, etc)
●
● Liquidity Coverage Ratio(LCR)
○ Structured to ensure that banks possess enough high-quality liquid
assets(HQLA) to survive a period of market dislocation and illiquidity lasting 30
calendar days. A 30-day period is deemed to be the minimum necessary to allow
the bank’s management enough time to take remedial action.
○ LCR currently has to equal or exceed a statutory threshold of 70% according to
Pillar I requirements.
● High Quality Liquid Assets(HQLA):unencumbered, high-quality liquid assets held by the
firm across entities
○ Level 1: most liquid under the LCR Rule and are eligible for inclusion in a firm’s
HQLA amount without a haircut or limit
○ Level 2A: assets are subject to a haircut of 15% of their fair value
○ Level 2B: assets are subject to a haircut of 50% of their fair value
○ In addition, the sum of Level 2A and 2B assets cannot comprise more than 40%
of a firm’s HQLA amount, and Level 2B assets cannot comprise more than 15%
of a firm’s HQLA amount.
● Unsecured and Secured Financing: primary sources of funding are deposits,
collateralized financings, unsecured short- and long-term borrowings, and shareholders’
equity
○ Unsecured Net Cash Outflows: Savings, demand and time deposits, from private
bank clients, consumers, transaction banking clients
○ Secured Net Cash Outflows: repurchase agreements, securities loaned and other
secured financings
1. Be Customer Obsessed
2. Be Courageous with Your Point of View
3. Challenge the Status Quo
4. Act Like an Entrepreneur
5. Have an “It Can Be Done” Attitude
6. Do the Right Thing
7. Be Accountable