Data Science Statistics Mathematics Cheat Sheet
Data Science Statistics Mathematics Cheat Sheet
You have 2 free member-only stories left this month. Sign up for Medium and get an extra one
Pixabay
Regressor Metrics. MAE, MSE, RMSE, MSLE, R². What they are, when to use it,
how to implement it.
Classifier Metrics
Classifier metrics are metrics used to evaluate the performance of machine learning
classifiers — models that put each training example into one of several discrete
categories.
F1 Score combines precision and recall through the harmonic mean. The exact
formula for it is (2 × precision × recall) / (precision + recall) . The harmonic
mean is used since it penalizes more extreme values, opposed to the mean, which is
naïve in that it weights all errors the same.
Using F1 Score vs Accuracy: The F1 score should be used when not making
mistakes is more important (False Positives and False Negatives being penalized
more heavily), whereas accuracy should be used when the model’s goal is to
optimize performance. Both metrics are used based on context, and perform
differently depending on the data. Generally, however, the F1-score is better for
imbalanced classes (for example, cancer diagnoses, when there are vastly more
negatives than positives) whereas accuracy is better for more balanced classes.
Sensitivity/Recall: recall_score
Precision: precision_score
F1 Score: f1_score
Accuracy: accuracy_score
Balanced Accuracy (for unevenly distributed classes): balanced_accuracy_score
Regressor Metrics
Regression metrics are used to measure how well a model that puts a training
example on a continuous scale, such as determining the price of a house.
Mean Absolute Error (MAE) is perhaps the most common and interpretable
regression metric. MAE calculates the difference between each data point’s
predicted y-value and the real y-value, then averages ever y difference (the
difference being calculated as the absolute value of one value minus the other).
Median Absolute Error is another metric of evaluating the average error. While it
has the benefit of moving the error distribution lower by focusing on the middle
error value, it also tends to ignore extremely high or low errors that are factored
into the mean absolute error.
Mean Square Error (MSE) is another commonly used regression metric that
‘punishes’ higher errors more. For example, an error (difference) of 2 would be
weighted as 4, whereas an error of 5 would be weighted as 25, meaning that MSE
finds the difference between the two errors as 21, whereas MAE weights the
difference at its face value — 3. MSE calculates the square of each data point’s
predicted y-value and real y-value, then averages the squares.
Root Mean Square Error (RMSE) is used to give a level of interpretability that
mean square error lacks. By square-rooting the MSE, we achieve a metric similar to
MAE in that it is on a similar scale, while still weighting higher errors at higher
levels.
R²: r2_score
Statistical Indicators
4 main data science statistical measures.
The correlation coefficient can be accessed using the .corr() function through
Pandas DataFrames. Consider the following two sequences:
seq1 = [0,0.5,0.74,1.5,2.9]
seq2 = [4,4.9,8.2,8.3,12.9]
This can be done in Python with numpy.cov(a,b)[0][1] , where a and b are the
sequences to be compared.
Standard Deviation is the square root of the variance, and is a more scaled statistic
for how spread out a distribution is. Standard deviation can be measured in the
statistics librar y with statistics.stdev(list) .
Types of Distributions
Knowing your distributions are ver y important when dealing with data analysis and
understanding which statistical and machine learning methods to use.
How Data Science Gave the Allied Forces an Edge in World War II
The German Tank Problem with computer simulations
medium.com
Normal Distribution is a ver y common distribution that resembles a cur ve (one
name for the distribution is the ‘Bell Cur ve’). Besides its common use in data
science, it is where most things in the universe can be described by, like IQ or salar y.
It is characterized by the following features:
Many machine learning algorithms require a normal distribution among the data.
For example, logistic regression requires the residuals be normally distributed. This
can be visualized and recognized with a residplot() . Information and examples of
the usage of this and other statistical models can be found here.
Hands-on real-world examples, research, tutorials, and cutting-edge techniques delivered Monday to
Thursday. Make learning your daily ritual. Take a look
Your email
By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information
about our privacy practices.
1.1K 2
Ab t H l L l
About Help Legal