0% found this document useful (0 votes)

9 views30 pages

General ML Notes

The document outlines key steps and considerations for data science and machine learning projects, emphasizing the importance of hypothesis generation, data cleaning, and exploration. It also covers methodologies for A/B testing, basic exploratory data analysis (EDA), inferential statistics, and the advantages and disadvantages of various traditional machine learning algorithms. Additionally, it discusses logistic regression, maximum likelihood estimation, and the process of hypothesis testing, including the interpretation of p-values and potential errors.

Uploaded by

harshverma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views30 pages

General ML Notes

Uploaded by

harshverma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

Things To Do For Any DS/ML Project

● Hypothesis Generation is More Important Than you Think

● Knowledge of Data Science Tools is Good; the Ability to Break Down Business Problems
is Priceless
● Be Prepared To Do a LOT of Data Cleaning
● Data Exploration is the most underrated step in data science.
● Believe me, you Need a Benchmark Model

ML Case Study
● Ask questions about the business and how ML is used
● Ask about current model and metrics and features available
● Break down the problem
○ Think about the metric
■ Is data imbalance
○ Model selection
■ Is explainability important
■ How much data do we have
■ Training and prediction times for production
■ Accuracy
○ Post production shadowing

AB Testing
● Steps
○ Problem Statement: What is the goal of the experiment
■ Understand the business problem(Eg: Change in recommendation algo)
■ What is the intended effect of the change
■ Understand the user journey of the given usecase
■ User the user journey to create the success metric(Eg: Revenue per user
per day)
● Is the metric measurable
● Is it attributable(can assign cause)
● Is it sensitive and variable to distinguish treatment from control
● Is it timely
○ Hypothesis testing: What result do you hypothesize from the experiment
■ State the null and alternative hypothesis
■ Set the alpha/signifiance level (0.05)
■ Set the statistical power (0.80)
■ Set the minimum detectable effect(MDE). Usually 1% for big incs.
○ Design the experiment: What are your experiment parameters
■ Set the randomization unit(Eg: User)
■ Target the population in the experiment from the user journey/funnel(Eg:
users that search a product)
■ Determine the sample size

■ Duration of experiment(1 -2 weeks for day of the week effect)

○ Run the experiment: What are the requirements for running an experiment
■ Set up instruments/ data pipeline to collect data
■ Dont look at data before experiment completes
○ Validity Checks: Did the experiment run soundly without errors or bias
■ Instrumentation Effect: Guardrail metrics other than main metric
■ External Factors: Holidays, laws, competition, disruptions
■ Selection Bias: A/A Test
■ Sample Ratio Mismatch: not 50:50
■ Novelty Effect: segment new and old visitors
○ Interpret the result: In which direction is the metric significant, statistically or
practically
■ Look at significance level and p-value to make the call
○ Launch Decision: Based on the results and trade-offs should the change launch
■ Metric Trade-Offs: Primary metric may improve but secondary metrics
may decline
■ Cost of Launching
■ Risk of False Positive from the experiment

Basic EDA Steps

● Examine the data distribution
● Handling missing values of the dataset(a most common issue with every dataset)
● Handling the outliers
● Removing duplicate data
● Encoding the categorical variables
● Normalizing and Scaling
○ Machine learning algorithms like linear regression, logistic regression, neural
network, PCA (principal component analysis), etc., that use gradient descent as
an optimization technique require data to be scaled.
○ Having features on a similar scale can help the gradient descent converge more
quickly towards the minima.
○ Distance algorithms like KNN, K-means clustering, and SVM(support vector
machines) are most affected by the range of features. This is because, behind
the scenes, they are using distances between data points to determine their
similarity.
○ Tree-based algorithms, on the other hand, are fairly insensitive to the scale of the
features.
○ Normalization is a scaling technique in which values are shifted and rescaled so
that they end up ranging between 0 and 1. It is also known as Min-Max scaling.

○ Standardization is another scaling method where the values are centered

around the mean with a unit standard deviation.

Differences between normalization and standardization

Normalization Standardization

Rescales values to a range between 0 and Centers data around the mean and scales to a
1 standard deviation of 1

Useful when the distribution of the data is Useful when the distribution of the data is
unknown or not Gaussian Gaussian or unknown

Sensitive to outliers Less sensitive to outliers

Retains the shape of the original distribution Changes the shape of the original distribution

May not preserve the relationships between Preserves the relationships between the data
the data points points

Equation: (x – min)/(max – min) Equation: (x – mean)/standard deviation

Inferential Statistics
● Central Limit Theorem
○ When plotting a sampling distribution of means, the mean of sample means will
be equal to the population mean. And the sampling distribution will approach a
normal distribution with variance equal to σ/√n where σ is the standard deviation
of population and n is the sample size.
○ Holds true irrespective of the type of distribution of the population.
○ Bigger the sample drawn from repeated observations(usually 50 is sufficient),
lower the standard error and greater the accuracy in determining the population
mean from the sample mean.
● Confidence Interval
○ The confidence interval is a type of interval estimate from the sampling
distribution which gives a range of values in which the population statistic may lie.
○ 95% of the population lies within 2(1.96) standard deviation of a normal
distribution curve.
○ By 95% Confidence Interval, we do not mean that – The probability of a
population mean to lie in an interval is 95%. Instead, 95% C.I means that 95% of
the Interval estimates will contain the population statistic.

○
Where x(bar) is mean, z is z value from table for given C.I., Delta is std. Dev., n is
sample size
○
Advantages and Disadvantages of
Traditional Machine Learning
Algorithms
Algorithm Advantage Disadvantage

SVM 1. Performs well in higher 1. Slow for larger dataset

dimensions 2. Poor performance with
2. Best algorithm when classes overlapped classes
are separable. 3. Selecting appropriate
3. Outliers have less impact hyperparameters is important.
4. Selecting the appropriate
kernel function is necessary

Naive Bayes 1. Real-time predictions (fast) 1. The independence of features

2. Scalable with large dataset does not hold every time.
3. Insensitive to irrelevant 2. Bad estimator(predict_proba
features output not to be taken
4. Multi-class prediction is done seriously)
effectively 3. Training data should represent
5. Good performance with high the population well
dimensional data

Logistic 1. Simple to implement and 1. Poor performance on

Regression effective. non-linear data
2. Feature scaling not needed 2. Poor performance if
3. Tuning of hyperparameters assumptions do not hold
not needed 3. Can be easily outperformed by
other algos.

Random 1. Can decorrelate features 1. Features must have some

Forest since different trees get amount of predictive power
different sets of features 2. Predictions of trees need to be
2. Reduces errors from individual uncorrelated
trees since it's an ensemble 3. Appears mostly as a black box.
3. Good performance on
imbalance datasets
4. Can handle a large amount of
data
5. Little impact of outliers
6. More generalization and less
overfitting
7. Feature importance can be
extracted
Decision 1. Normalization or Scaling is not 1. Prone to overfitting
Trees needed 2. Sensitive to data changes
2. Handles missing values 3. Longer training time
3. Easy to explain and visualize
4. Automatic feature selection

XGBoost 1. Less feature engineering is 1. Difficult to interpret and

required visualize
2. Feature importance can be 2. Overfitting is possible if
found parameters are not tuned
3. Outliers have minimal impact properly
4. Handles large datasets well 3. Harder to tune as there are too
5. Good model performance many parameters
6. Less prone to overfitting

k-NN 1. Simple to understand and 1. Slow for large datasets(curse

implement of dimensionality)
2. No assumption about data 2. Scaling of data is required
3. The model evolves when 3. Does not work well on
exposed to new data imbalanced data
4. Can be used for multi-class 4. Sensitive to outliers
problems 5. Cannot deal with missing
5. Few parameters to tune values

Linear Regression
● Parametric Form
𝑌=𝛽0+𝛽1 𝑋1+𝛽2 𝑋2+…+𝛽𝑝 𝑋𝑝+𝜀

● Residuals are the difference from the regression line to the actual value of the data point
● Cost Function
○ We use mean squared error which is derived from maximum likelihood estimator

○
● Assumptions

❖ Linearity of the response-predictor relationship

❖ Independence of the error terms (residuals)
❖ Normal distribution of the error terms
❖ Equal (constant) variance of the error terms (homoscedastic)
● Solutions to Data Issues

Missing data Imputation

Remove the observation

Outliers Discretization
Winsorization

“Messy” data Programming correction

Heavily skewed variables Transformation (e.g., natural log)

High collinearity Dimension reduction(Check through VIF and correlation

matrix)

High cardinality Binning

Variables of disparate Standardization

magnitude

High dimensional data Dimension reduction

Feature engineering
Logistic Regression

● Basics
○ Used when the target variable is binary
○ Used to predict the probabilities for classification problems
● Why Logistic Regression rather than Linear Regression
○ Dependent variable is binary instead of continuous
○ Adding outliers will cause the best fit line to shift to fit that point
○ Predicted values from linear regression can be out of range or probabilities(0, 1)
● Logistic Function

○ From the best-fit line in linear regression we take log of

odds
■ Why do we take odds?
■ Log of odds is taken to equalize the range of the output. Since odds will
be (0, inf) and log(odds) will be (-inf , +inf)
■
● Cost Function
○ The Yi in logistic regression is a non-linear function and will give non-convex
graph

Problem is that we can get results from this but it may be a local minima
○ We can derive log loss using maximum likelihood estimation

● Assumptions
○ Target is binary
■ Check by simply counting target variable types
○ Observations are independent
■ Check with residual plot analysis
○ There is no multicollinearity among features
■ Check with VIF or other such methods
○ No extreme outliers
■ Check with cook’s distance and choose how to deal with the variables
○ There is a Linear Relationship Between Explanatory Variables and the Logit of
the Response Variable
■ Check with Box-Tidwell test
○ The Sample Size is Sufficiently Large
● Assumptions of Logistic Regression vs. Linear Regression
○ In contrast to linear regression, logistic regression does not require:
■ A linear relationship between the explanatory variable(s) and the
response variable.
■ The residuals of the model to be normally distributed.
■ The residuals to have constant variance, also known as homoscedasticity.
Maximum Likelihood Estimation
● Basics
○ Density estimation is used to estimate the probability distribution for a sample of
observations from a problem domain.
○ MLE is a framework/technique used for solving density estimation.
○ It defines a likelihood function for calculating the conditional probability of
observing the data sample given a probability distribution and distribution
parameters.

○
○
●
P-value and Hypothesis Testing
Hypothesis testing is a fundamental concept in statistics and plays a crucial role in data
analysis, decision-making, and scientific research. It enables us to draw conclusions about
populations based on sample data. One important aspect of hypothesis testing is the calculation
and interpretation of p-values.

Hypothesis Testing Process:

1. Formulating the Hypotheses:

Null Hypothesis (H0): It represents the default or status quo assumption and is typically denoted
as H0. It states that there is no significant difference or relationship between variables or no
effect of a treatment.
Alternative Hypothesis (Ha): It represents the claim or the assertion that contradicts the null
hypothesis. It states that there is a significant difference or relationship between variables or an
effect of a treatment.

2. Selecting the Test:

The choice of the appropriate statistical test depends on various factors such as the nature of
the data and the research question. Common tests include t-tests, chi-square tests, ANOVA,
etc.

3. Collecting and Analyzing the Data:

Data is collected and analyzed to determine whether the observed results provide enough
evidence to reject or fail to reject the null hypothesis.

4. Calculating the P-value:

The p-value is the probability of obtaining a result as extreme or more extreme than the
observed data, assuming the null hypothesis is true. It helps us assess the strength of evidence
against the null hypothesis.

Interpreting P-values:

● If the p-value is below a predetermined threshold (e.g., 0.05), the result is considered
statistically significant, and we reject the null hypothesis in favor of the alternative
hypothesis.
● If the p-value is above the threshold, we fail to reject the null hypothesis due to
insufficient evidence to support the alternative hypothesis.

Considerations and Limitations:

● P-values are not measures of the magnitude or practical significance of an effect; they
only indicate the strength of statistical evidence.
● The threshold for determining statistical significance (often 0.05) is somewhat arbitrary
and should be chosen carefully, considering the consequences of Type I and Type II
errors.
● P-values can be affected by sample size, effect size, variability in the data, and other
factors. It is crucial to interpret them in the appropriate context.

The p-value is the probability of observing the results of the Null Hypothesis. The significance
level is the target value, which should be achieved if we want to retain the Null Hypothesis.
Hence, as long as the p-value is less than the significance level, we must reject the null
hypothesis. Lower p-value means, the population or the entire data has strong evidence against
the null hypothesis.

● Type I Error :
This error occurs when the decision to reject the null hypothesis goes wrong. In this error, the
alternative hypothesis H₁ is chosen, when the null hypothesis H₀ is true. This is also called False
Positive.

Type I error is often denoted by alpha α i.e. significance level. alpha α is the threshold for the
percentage of Type I errors we are willing to commit.

● Type II Error :
This error occurs when the decision to retain the null hypothesis goes wrong. In this error, the
null hypothesis H₀ is chosen, when the alternative hypothesis H₁ is true. This error is also called
False Negative.

However, these errors are always present in the statistical tests and must be kept in mind while
interpreting the results.
T-test, Z-test, F-test, chi-square test,
ANOVA
● Z-score/test: Tells how many standard deviations from the mean the result is. Z score of
1 means 1 delta, 2 means 2 delta, etc.
○ Simply put, a z-score (also called a standard score) gives you an idea of how far
from the mean a data point is. But more technically it’s a measure of how many
standard deviations below or above the population mean a raw score is
○ It enables us to compare two scores from different samples
○ Within 1 delta on each side = 68% of the population
○ Within 2 delta on each side = 95% of the population
○ Within 3 delta on each side = 99.7% of the population

○
○ When to use:
■ If standard deviation of the population is known
■ If sample size if above 30
○
● T-tests:
○ T-tests are very much similar to the z-scores, the only difference being that
instead of the Population Standard Deviation, we now use the Sample Standard
Deviation.

○ Sample standard deviation :

where n-1 is the Bessel’s correction for estimating the population parameter.
○
○ When to use:
■ If standard deviation of the population is not known
■ If sample size if below 30
● ANOVA(Analysis of Variance): used to check if at least one of two or more groups have
statistically different means
○ We cannot use multiple t-tests since it will compound the error rate from each
individual t-test
○ Requirements:
■ Must have continuous response variable
■ Atleast one categorical feature with two or more levels.
■ Require data from approximately normally distributed populations with
equal variances between factor levels.
■ ANOVA procedures work quite well even if the normality assumption has
been violated unless one or more of the distributions are highly skewed or
if the variances are quite different.
○ Practical applications of ANOVA in modeling are:
■ Identifying whether a categorical variable is relevant to a continuous
variable.
■ Identifying whether a treatment was effective to the model or not.
○ Note: ANOVA only lets us know the means for different groups are same or not. It
doesn’t help us identify which mean is different.To know which group mean is
different, we can use another test know as Least Significant Difference Test.
● Chi Square Test
○ The chi-square distribution is actually a series of distributions that vary in shape
according to their degrees of freedom.
○ The chi-square test is a hypothesis test designed to test for a statistically
significant relationship between nominal and ordinal variables organized in a
bivariate table. In other words, it tells us whether two variables are independent
of one another.
○ The obtained chi-square statistic essentially summarizes the difference between
the frequencies actually observed in a bivariate table and the frequencies we
would expect to see if there were no relationship between the two variables.
○ The chi-square test is sensitive to sample size.
○ The chi-square test cannot establish a causal relationship between two variables.
○ Limitations:
■ The chi-square test is very sensitive to sample size. With a large enough
sample, even trivial relationships can appear to be statistically significant.
When using the chi-square test, you should keep in mind that "statistically
significant" doesn't necessarily mean "meaningful."
■ The chi-square can only tell us whether two variables are related to one
another. It does not necessarily imply that one variable has any causal
effect on the other. In order to establish causality, a more detailed
analysis would be required.

Metrics

●
●

ML Model Development Lifecycles

There is not a single accepted ML Model Development Lifecycle, however, most of them
convey very similar ideas.

1. From Google Cloud

There are 3 major phases:

a. Planning
ML models are dynamic and keep learning as more data is given to them. Some models
may not perform as well in production, prompting developers to rethink their design and
approach(CI/CD). The planning can also relate to which metrics should be
delivered/used, how often we need to analyze model performance, which type of model
should be used, etc.
b. Data Engineering
The vast majority of time and resources are spent on handling data. The data needs to
be extracted, cleaned, preprocessed, stored, served, visualized, analyzed, etc. This
takes up physical resources as well as personnel hours.
c. Modeling
This is done throughout the development lifecycle in one way or another. Choosing the
right framework and platform is necessary given data and resource restrictions.
Accuracy and interpretability also need to be taken into account during this phase.
Considerations about model deployment, drift, and serving also need to be done.

2. From KDnuggets
It is very important to have an ecosystem to build, test, deploy, and maintain the
enterprise grade machine learning models in production environments. The ML model
development involves data acquisition from multiple trusted sources, data processing to
make suitable for building the model, choose algorithm to build the model, build model,
compute performance metrics and choose best performing model. The model
maintenance plays critical role once the model is deployed into production.
Phases
a. Model Development

● EDA steps
● Choosing the right algorithm

ML Model Development Best Practices: The recommended ML Model Development Best

Practices are
(1) Perform clear hypothesis for the identified business problem before attributes identification
itself;
(2) Build the model using a basic algorithm first, like a logistic regression or decision tree and
compile performance metrics, that gives enough confidence about relevance of data before
going for a fancier algorithms, like neural networks;
(3) Keep intermediate checkpoints in building the model to keep track of its model
hyperparameters and its associated performance metrics that gives an ability to train the model
incrementally and make good judgement when it comes to performance vs training time;
(4) Use real-world production data for training the model to improve correctness of predictions.

b. Model Operations
ML Model Deployment Best Practices: The recommended model deployment best practices
are
(1) Automate the steps required for ML model development and deployment, by leveraging
DevOps tools that gives some extra time for model retraining;
(2) Perform continuous model testing, performance monitoring and retraining after its production
deployment to keep the model relevant/current as the source data changes to predict the
desired outcome(s);
(3) Implement logging while exposing ML models as APIs, that includes capturing input
features/model output (to track model drift), application context (for debugging production
errors), model version (if multiple re-trained models are deployed in production);
(4) Manage all the model metadata in a single repository.
Liquidity Risk and Analytics

● Liquidity risk occurs when a financial institution is unable to meet its short-term debt
obligations
● Particular worry during periods of market stress(changes in risk appetite, interest in
certain assets, etc)

●
● Liquidity Coverage Ratio(LCR)
○ Structured to ensure that banks possess enough high-quality liquid
assets(HQLA) to survive a period of market dislocation and illiquidity lasting 30
calendar days. A 30-day period is deemed to be the minimum necessary to allow
the bank’s management enough time to take remedial action.
○ LCR currently has to equal or exceed a statutory threshold of 70% according to
Pillar I requirements.
● High Quality Liquid Assets(HQLA):unencumbered, high-quality liquid assets held by the
firm across entities
○ Level 1: most liquid under the LCR Rule and are eligible for inclusion in a firm’s
HQLA amount without a haircut or limit
○ Level 2A: assets are subject to a haircut of 15% of their fair value
○ Level 2B: assets are subject to a haircut of 50% of their fair value
○ In addition, the sum of Level 2A and 2B assets cannot comprise more than 40%
of a firm’s HQLA amount, and Level 2B assets cannot comprise more than 15%
of a firm’s HQLA amount.
● Unsecured and Secured Financing: primary sources of funding are deposits,
collateralized financings, unsecured short- and long-term borrowings, and shareholders’
equity
○ Unsecured Net Cash Outflows: Savings, demand and time deposits, from private
bank clients, consumers, transaction banking clients
○ Secured Net Cash Outflows: repurchase agreements, securities loaned and other
secured financings

● Net Stable Funding Ratio(NSFR)

○ The NSFR, however, is designed to fortify a bank’s liquidity over a longer time
period and seeks to do so by incentivising banks to rely on more stable sources
of funding rather than often illiquid assets.
○ Under the NSFR Rule we are required to maintain an amount of available stable
funding (ASF) equal to or greater than our projected minimum funding needs, or
required stable funding (RSF), over a one-year time horizon.
○ Institutions subject to the NSFR Rule must make public disclosures of their NSFR
on a semi-annual basis. This includes disclosing the average daily NSFR over
the preceding two quarters, along with quantitative and qualitative information
about components of the institution's NSFR.
○ NSFR may drop below the requirement of 100% during a time of stress.
● Balance Sheet Management: ability to manage the size and composition of our balance
sheet. The size and composition of our balance sheet also reflects
○ overall risk tolerance
○ the amount of capital we hold
○ funding profile
● Retail Funding: predominantly raised in bank subsidiaries, and primarily consists of
deposits, including savings, demand and time deposits, from private bank clients,
consumers, transaction banking clients
● Wholesale Funding: unsecured wholesale funding primarily consists of deposits from
certain private bank clients, transaction banking clients (who may place operational
deposits), other institutional clients, and through internal and third-party broker-dealers.
●

● Designed to improve the resiliency of banks to short-tern liquidity crunches

●
● The dangers of liquidity risk are particularly acute when they are asked to execute
margin calls on derivative positions in times of market stress.
○ In these periods, margin calls are increasing as asset values depreciate, but
market liquidity is also drying up. Assets which might have taken a day or two to
liquidate can now take 10 or 15 days and then only with a much increased
haircut. Asset managers are caught in the rip tide.
○ Variation Margin: increases or decreases to reflect day to day changes in
derivative instruments
○ Initial Margin: posted to protect counterparties against possible future changes in
market value of instruments in the event of a counterparty default
●
● Liquidity risk management is generally not part of a long-term and continuous strategy,
but all too often viewed as a short-term operational problem.
● Those in the industry speak of a couple of staffers getting together in an office every two
or three months with only an Excel spreadsheet for company - and this is even at larger
funds.
● Risk management is frequently siloed according to asset class rather than centrally
handled.
● The resulting risk management effort thus duly duplicated, with multiple and confusing
sources of information.
● What is required is an automated, integrated and single view of liquidity risk across
multiple asset classes in all markets. This single platform, operated in a central location,
would have the capacity to run forward-looking liquidity analysis, calculate and report
liquidity risk exposure, which takes into account all potential future obligations.
●
Delivery Analytics

1. Be Customer Obsessed
2. Be Courageous with Your Point of View
3. Challenge the Status Quo
4. Act Like an Entrepreneur
5. Have an “It Can Be Done” Attitude
6. Do the Right Thing
7. Be Accountable

Data Analysis and Interpretation:

o Analyze large and complex datasets to identify trends, patterns, and insights.
o Interpret and communicate findings to both technical and non-technical stakeholders.
Machine Learning Model Development:
o Develop and implement machine learning models for predictive and prescriptive
analytics.
o Evaluate and fine-tune models to ensure optimal performance.
Data Cleaning and Preprocessing:
o Clean and preprocess raw data to prepare it for analysis and modeling.
o Address missing or inconsistent data to maintain data quality.
Statistical Analysis:
o Conduct statistical analysis to validate hypotheses and draw meaningful conclusions.
o Collaborate with cross-functional teams to design and execute experiments.
Feature Engineering:
o Identify relevant features and variables for model development.
o Optimize feature selection to enhance model accuracy.
Data Visualization:
o Create visually appealing and insightful data visualizations to facilitate understanding.
o Use tools like PowerBI, matplotlib, or others to communicate complex data stories.
Collaboration and Communication:
o Work closely with other teams to understand business requirements and objectives.
o Effectively communicate technical concepts to non-technical stakeholders.
Continuous Learning:
o Stay updated on the latest advancements in data science and machine learning.
o Apply new techniques and technologies to improve existing models and processes.

The Data Analytics Revolution in LAST-MILE Delivery

● With population rising, moving to cities, cities becoming more congested, supply chain
implications are immense. To remain competitive, companies across all sectors,
industries, and markets need to succeed at serving urban customers and
consumers.Particularly true in retail.
○ Increasing urban density
○ Increasing city sizes horizontally and vertically
○ Thus planing efficient and reliable last mile delivery is becoming more important.
Requires more operational flexibility and redundancy.
Its like a problem thats slowly fighting back over time. And Machine Learning
and data science is poised very well to solve something like this.
○ Retail is facing a boom in ecommerce with rising customer expectations
regarding speed, customisation and products
○ This trend toward “on-demand consumerism” renders the last-mile
performance of a company’s logistics operations a critical competitive factor.
○ Increasingly difficult for logistics service providers to consolidate shipments
into efficient delivery routes.
○ Adopting anticipatory planning and flexible operations that allow for dynamic
and proactive adjustments to rapidly changing operational conditions and
market requirements.
● Data is Vital
○ They are using transactional records, delivery data from individual routes,
high-resolution telemetry and movement data on the level of individual vehicles
to develop a more precise picture of their last-mile logistics operations.
○ These data sources provide insights into the operational and commercial
environment - from road infrastructure characteristics to traffic and
congestion dynamics to sociodemographic profiles of the customer base.
○ Companies are now able to obtain reliable, high-resolution and real-time
visibility into their transportation networks.
○ This new level of detail and dynamism enables companies to substantially
reduce the number of blind spots in their distribution networks.
○ Based on historic route plans and delivery records, machine learning tools helped
identify customer-specific delivery constraints. Address these issues by
re-incorporating the information into its route planning algorithms, or by
reconfiguring distribution services for certain customers.
○ Analysis of high-resolution GPS traces in conjunction with telemetry data and
transactional records can provide relevant insights on the availability and
suitability of local infrastructures such as roads and parking bays for
last-mile delivery. The data can reveal behavioral patterns of drivers and
delivery crews that have local knowledge about their route territory and know
better than any algorithm or data source where to park, which shortcut to take, or
which congestion hotspot to avoid. Extracting this knowledge without having
to disrupt crew member workflows can achieve significant improvements in
route planning and more effective delivery instructions.
○ Maximize service levels and minimize cost inefficiencies
● Actions and counter actions
○ Transition company from traditional, static, single-tiered, uni-modal urban
distribution approach that is built around various dimensions of operational
flexibility. Account for dynamic changes in company’s distribution approach in
response to dynamically unfolding customer requirements and operational
conditions such as traffic or weather.
○ Large-scale optimization models informed by descriptive and predictive
analyses of both historic and real-time data can provide answers to these and
other questions.
○ Faster last-mile delivery can have implications on inventory management
○ Should inventory be decentralized into hyper-local storage locations
○ Introduction of drones like walmart has started in the DFW area
● Data Sharing Benefits
○ Vendor-to-vendor, carrier-to-carrier, and vendor-to-carrier connectivity will
finally enable the safe, efficient and mutually beneficial sharing of transportation
infrastructure, facilities and fleets
○ Vehicle-to-infrastructure connectivity will allow for optimizing vehicle flows and
sharing information about traffic, accidents, the availability of parking spaces and
other relevant operational constraints to last-mile delivery.
● Important Metrics
○ On-Time Delivery Rate:
■ paramount for customer satisfaction
■ highly recommended to utilize route optimization solutions
○ Delivery Cost Per Mile:
■ essential for cost management
■ route analytics alone could help a business save costs
○ Customer Satisfaction Rating
■ feedback data is pivotal in optimizing the delivery process
○ Driver Performance
■ essential for prompt deliveries and superior customer service
○ Returns Rate
■ can indicate issues with delivery quality or product selection
○ Route Efficiency
■ minimizes time and fuel consumption, reducing operational costs
○ Delivery Time Windows
■ anticipating real-time delivery status updates as a standard service
■ accurate delivery time estimates and consistently meeting customer
expectations are pivotal to successful last-mile delivery operations
○ Inventory Accuracy
■ provides businesses with the ability to forecast demand accurately
■ companies can prevent overstocking or understocking

Theories, Models and Approaches Applied To Midwifery Practices
83% (12)
Theories, Models and Approaches Applied To Midwifery Practices
111 pages
Data Science Cheatsheet
100% (1)
Data Science Cheatsheet
5 pages
Machine Learning Interview Questions
From Everand
Machine Learning Interview Questions
Tech Interviews
4.5/5 (2)
Gold'S Gym: Segmentation, Targeting and Positioning
No ratings yet
Gold'S Gym: Segmentation, Targeting and Positioning
22 pages
MYP 1 - Cri. C Summative - Science
No ratings yet
MYP 1 - Cri. C Summative - Science
4 pages
ML Lectures Summary 2
No ratings yet
ML Lectures Summary 2
52 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
4 pages
Final ML
No ratings yet
Final ML
2 pages
EDAN96 2024 Last Lecture-1
No ratings yet
EDAN96 2024 Last Lecture-1
78 pages
Machine Learning Lecture1 - 26-27 Aug
No ratings yet
Machine Learning Lecture1 - 26-27 Aug
30 pages
Unit 3
No ratings yet
Unit 3
55 pages
Chapter 02 Overview - 4
No ratings yet
Chapter 02 Overview - 4
43 pages
Pattern Summary Final
No ratings yet
Pattern Summary Final
28 pages
Data Science Cheatsheet 2.0: Statistics Model Evaluation Logistic Regression
No ratings yet
Data Science Cheatsheet 2.0: Statistics Model Evaluation Logistic Regression
4 pages
Lecture 5
No ratings yet
Lecture 5
26 pages
Data Science Dse
No ratings yet
Data Science Dse
24 pages
Week 4 - Intro To ML
No ratings yet
Week 4 - Intro To ML
37 pages
ML Unit 1
No ratings yet
ML Unit 1
73 pages
MLE
No ratings yet
MLE
15 pages
Machine Learning
No ratings yet
Machine Learning
37 pages
SML
No ratings yet
SML
8 pages
Unit6 Part3 General Procedure
No ratings yet
Unit6 Part3 General Procedure
19 pages
ChatGPT - Machine Learning Overview
No ratings yet
ChatGPT - Machine Learning Overview
34 pages
Pattern L1 L6
No ratings yet
Pattern L1 L6
19 pages
My Notes
No ratings yet
My Notes
15 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
112 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
6 pages
Machine Learning HC
No ratings yet
Machine Learning HC
4 pages
Machine Learning With Real Life Project: by - Rishabh Gaur
100% (2)
Machine Learning With Real Life Project: by - Rishabh Gaur
26 pages
Machine Learning Models: by Mayuri Bhandari
No ratings yet
Machine Learning Models: by Mayuri Bhandari
48 pages
Bi Intro
No ratings yet
Bi Intro
24 pages
Unit III - I
No ratings yet
Unit III - I
15 pages
Week11 - Regularization and Optimization
No ratings yet
Week11 - Regularization and Optimization
75 pages
Data Science Checklist
No ratings yet
Data Science Checklist
22 pages
Machine Learning
No ratings yet
Machine Learning
6 pages
Overfitting & Feature Engineering
No ratings yet
Overfitting & Feature Engineering
37 pages
Python 06 MachineLearning
No ratings yet
Python 06 MachineLearning
45 pages
Data Preprocessing: Essential Steps For Preparing Data Before Modeling
No ratings yet
Data Preprocessing: Essential Steps For Preparing Data Before Modeling
111 pages
Fam Question Bank CT
No ratings yet
Fam Question Bank CT
14 pages
ML Algorithms Week 3
No ratings yet
ML Algorithms Week 3
30 pages
Data Science Final Syllabus
No ratings yet
Data Science Final Syllabus
8 pages
Data Science Cheatsheet
No ratings yet
Data Science Cheatsheet
4 pages
Week 12 Intro To DS and ML
No ratings yet
Week 12 Intro To DS and ML
67 pages
Unit 1
No ratings yet
Unit 1
21 pages
Machine Learning Project Checklist
No ratings yet
Machine Learning Project Checklist
30 pages
Chapter 2 Data Preprocessing
No ratings yet
Chapter 2 Data Preprocessing
23 pages
CS 620 / DASC 600 Introduction To Data Science & Analytics: Lecture 8-Performance Evaluation
No ratings yet
CS 620 / DASC 600 Introduction To Data Science & Analytics: Lecture 8-Performance Evaluation
62 pages
Big Data Lesson 2 Lucrezia Noli
No ratings yet
Big Data Lesson 2 Lucrezia Noli
21 pages
The Data Arena.
No ratings yet
The Data Arena.
11 pages
Machine Learning
No ratings yet
Machine Learning
10 pages
1 - Intro To Machine Learning
No ratings yet
1 - Intro To Machine Learning
34 pages
Session 5
No ratings yet
Session 5
36 pages
Chapter 02 Overview (R)
No ratings yet
Chapter 02 Overview (R)
43 pages
Lecture - 2 Classification (Machine Learning Basic and KNN)
No ratings yet
Lecture - 2 Classification (Machine Learning Basic and KNN)
94 pages
ML Model Paper 2 Solution
No ratings yet
ML Model Paper 2 Solution
15 pages
M2 - Supervised Machine Learning
No ratings yet
M2 - Supervised Machine Learning
79 pages
Statistics For Data Science
No ratings yet
Statistics For Data Science
39 pages
Tutorial 3
No ratings yet
Tutorial 3
30 pages
ML 02 Dataset-Feature Selection PDF
No ratings yet
ML 02 Dataset-Feature Selection PDF
44 pages
ML 1 PPT Unit 1
No ratings yet
ML 1 PPT Unit 1
93 pages
Important Tems
No ratings yet
Important Tems
61 pages
Decision Tree Pruning: Fundamentals and Applications
From Everand
Decision Tree Pruning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
From Everand
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
César Pérez López
No ratings yet
Philos 3NN3 - Enlightenment
No ratings yet
Philos 3NN3 - Enlightenment
52 pages
A Concept Exploration Method For Product Family Design
No ratings yet
A Concept Exploration Method For Product Family Design
598 pages
Conseptualizing Research
No ratings yet
Conseptualizing Research
27 pages
Cios 5 235
No ratings yet
Cios 5 235
8 pages
2023 Scientific Method - Skills Booklet
No ratings yet
2023 Scientific Method - Skills Booklet
18 pages
Competency 2 Assessment: Alyssa Cremer Rasmussen College General Psychology Jennifer Green October 2, 2020
No ratings yet
Competency 2 Assessment: Alyssa Cremer Rasmussen College General Psychology Jennifer Green October 2, 2020
5 pages
CRIM 406 Definition of Research
No ratings yet
CRIM 406 Definition of Research
10 pages
How To Wrap Your Head Around The Most Mind-Bending Theories of Reality - New Scientist
No ratings yet
How To Wrap Your Head Around The Most Mind-Bending Theories of Reality - New Scientist
2 pages
ACC 205 Exam Question. (1) 2
No ratings yet
ACC 205 Exam Question. (1) 2
5 pages
Autopsia Pisológica Pérez Janosch López Inglés
No ratings yet
Autopsia Pisológica Pérez Janosch López Inglés
16 pages
Study of Promotional Strategies of Cadbury in India
72% (18)
Study of Promotional Strategies of Cadbury in India
74 pages
Hypothesis and Its Application in Dental Research
No ratings yet
Hypothesis and Its Application in Dental Research
17 pages
Chapter-1: The Study On Effectiveness of Dealer Promotional Strategy at Toms Pipes
No ratings yet
Chapter-1: The Study On Effectiveness of Dealer Promotional Strategy at Toms Pipes
12 pages
Chapter 1 Slides - Critical Thinking
No ratings yet
Chapter 1 Slides - Critical Thinking
19 pages
Reading A2
No ratings yet
Reading A2
52 pages
English Test Singular 50
No ratings yet
English Test Singular 50
4 pages
Budgeting Title No.2
No ratings yet
Budgeting Title No.2
12 pages
Relationship Between Literature Review and Hypothesis
100% (1)
Relationship Between Literature Review and Hypothesis
8 pages
ACTION RESEARCH Pre Test-Post Test
No ratings yet
ACTION RESEARCH Pre Test-Post Test
4 pages
THE Teaching of Science: Group 2
No ratings yet
THE Teaching of Science: Group 2
21 pages
Experimental Psychology (Basics of Experimentation)
No ratings yet
Experimental Psychology (Basics of Experimentation)
7 pages
Different Types of Hypothesis
No ratings yet
Different Types of Hypothesis
3 pages
Carney 2015
No ratings yet
Carney 2015
7 pages
Using Attention Checks in Surveys May Harm Data Quality
No ratings yet
Using Attention Checks in Surveys May Harm Data Quality
17 pages
In The Partial Fulfillment of The Requirement For The Award of Degree in
No ratings yet
In The Partial Fulfillment of The Requirement For The Award of Degree in
17 pages
Power Function
No ratings yet
Power Function
7 pages
MARCELINO TERMITICIDE (Latest)
No ratings yet
MARCELINO TERMITICIDE (Latest)
34 pages

General ML Notes

Uploaded by

General ML Notes

Uploaded by

Things To Do For Any DS/ML Project

● Hypothesis Generation is More Important Than you Think

■ Duration of experiment(1 -2 weeks for day of the week effect)

Basic EDA Steps

○ Standardization is another scaling method where the values are centered

around the mean with a unit standard deviation.

Differences between normalization and standardization

Sensitive to outliers Less sensitive to outliers

Equation: (x – min)/(max – min) Equation: (x – mean)/standard deviation

SVM 1. Performs well in higher 1. Slow for larger dataset

Naive Bayes 1. Real-time predictions (fast) 1. The independence of features

Logistic 1. Simple to implement and 1. Poor performance on

Random 1. Can decorrelate features 1. Features must have some

XGBoost 1. Less feature engineering is 1. Difficult to interpret and

k-NN 1. Simple to understand and 1. Slow for large datasets(curse

❖ Linearity of the response-predictor relationship

Missing data Imputation

“Messy” data Programming correction

Heavily skewed variables Transformation (e.g., natural log)

High collinearity Dimension reduction(Check through VIF and correlation

High cardinality Binning

Variables of disparate Standardization

High dimensional data Dimension reduction

○ From the best-fit line in linear regression we take log of

Hypothesis Testing Process:

1. Formulating the Hypotheses:

2. Selecting the Test:

3. Collecting and Analyzing the Data:

4. Calculating the P-value:

Considerations and Limitations:

○ Sample standard deviation :

ML Model Development Lifecycles

1. From Google Cloud

There are 3 major phases:

ML Model Development Best Practices: The recommended ML Model Development Best

● Net Stable Funding Ratio(NSFR)

● Designed to improve the resiliency of banks to short-tern liquidity crunches

Data Analysis and Interpretation:

The Data Analytics Revolution in LAST-MILE Delivery

You might also like