Astros
Astros
1) Given a set of hitting performance metrics, how do you choose which, if any, of the metrics are useful for evaluating players? What
properties do you think a good measure of player performance should exhibit? Note that we are not asking for specific baseball metrics
you like, but rather a general approach to identifying whether a metric is useful.
EXPLANATION: When we are provided with a dataset of hitting performance metrics, and need to figure out which metrics/attributes are
useful for our analysis, we need to perform Exploratory Data Analysis(EDA).
Before that we need to ensure that our data is clean and doesn't consist of error values.
It is performed by handling the missing values, detecting the outliers, and scaling the attributes for uniformity.
OBJECTIVE: To showcase various methods to approach feature selection.
The basic idea in evaluating the metrics is to detect which metric has high covariance and correlation towards the desired
objective/output.
Covariance: It shows how two variables are closely related and change accordingly, which significantly constitutes the strength of the
relationship between them.
Correlation: It shows the strength of relationship as well as the direction which means the change in values affecting them either
positively or negatively.
I have taken a sample data to show more in detail.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# Creating a random dataset with atrributes as follows
example = {
'player': ['Player1', 'Player2', 'Player3', 'Player4', 'Player5'],
'batting_avg': [0.320, 0.275, 0.305, 0.290, 0.310],
'on_base_pct': [0.400, 0.350, 0.380, 0.370, 0.390],
'slugging_pct': [0.550, 0.450, 0.500, 0.480, 0.520],
'RBIs': [90, 70, 85, 80, 88],
'runs': [100, 85, 95, 90, 98],
'stolen_bases': [5, 15, 8, 7, 12],
'age': [27,21,20,28,25]
}
# creating the dataframe
df = pd.DataFrame(example)
sns.pairplot(df.drop('player', axis=1))
plt.suptitle('Pairplot of Hitting Metrics', y=1.02)
plt.show()
# Correlation Analysis and visualization of heatmap
correlation_matrix = df.drop('player', axis=1).corr()
print("Correlation Matrix:")
print(correlation_matrix)
plt.figure(figsize=(10, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix of Hitting Metrics')
plt.show()
Correlation Matrix:
batting_avg on_base_pct slugging_pct RBIs runs \
batting_avg 1.000000 0.992540 0.984185 0.982647 0.995701
on_base_pct 0.992540 1.000000 0.989813 0.986056 0.991678
slugging_pct 0.984185 0.989813 1.000000 0.953463 0.978235
RBIs 0.982647 0.986056 0.953463 1.000000 0.984983
runs 0.995701 0.991678 0.978235 0.984983 1.000000
stolen_bases -0.648027 -0.663151 -0.650462 -0.668256 -0.590085
age 0.337312 0.444936 0.442146 0.398734 0.337700
stolen_bases age
batting_avg -0.648027 0.337312
on_base_pct -0.663151 0.444936
slugging_pct -0.650462 0.442146
RBIs -0.668256 0.398734
runs -0.590085 0.337700
stolen_bases 1.000000 -0.545600
age -0.545600 1.000000
We can clearly see that compared to other features , stolen bases and age have less correlation with RBIs which indicates it doesn't
have significant relevance to our target variable.
Filter Methods:
Filter methods select features based on their scores in statistical tests for their correlation with the outcome variable. Examples
include the chi-squared test, information gain, and correlation coefficient scores. These methods are fast and straightforward but
they ignore the potential combined effect of individual features.
Wrapper Methods:
Wrapper methods consider the selection of a set of features as a search problem, where different combinations are prepared,
evaluated and compared to other combinations. A predictive model is used to evaluate a combination of features and assign a score
based on model accuracy. Examples of wrapper methods are recursive feature elimination and forward selection.
Embedded Methods:
Embedded methods learn which features best contribute to the accuracy of the model while the model is being created. The most
common type of embedded feature selection methods are regularization methods. Regularization methods are also called
penalization methods that introduce additional constraints into the optimization of a predictive algorithm (like a regression algorithm)
that bias the model toward lower complexity (fewer coefficients).
#Another example here where I have used embedded method to show feature selection with their importance
from sklearn.model_selection import train_test_split
X = df.drop(['RBIs', 'player'], axis=1)
y = df['RBIs']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Once we get to know the correlation values, we can now extract the important features which necessarily means feature selection.
We are eradicating the irrelevant features, and focusing on the important features.
2) When developing a model to predict player performance, what methods do you employ to ensure that your model is useful on novel
data?
EXPLANATION: Predicting player performance is imminent to better the performance of not just the player,but also uplifts the overall
team performance.
OBJECTIVE: Discuss on the methods to employ in a model that works well for novel data.
Once we acquire the historical data of the player, the foremost work is to clean the data removing outliers, imputing missing values with
mean , median values and normalizing the values.
Random splitting the entire data into training and testing data is ideal in order to make sure the model we develop is generalized and
works well for the novel data(test data/any new data being added in future).
Now, the question is just because we split the data does the model which we develop works well for novel data? Definitely Not!!!
We need to choose the best model which generalizes the work of predicting the performance of player without any bias, but this can be
by,
Feature Selection using Correlation , Lasso Regression or KNN Best features inbuilt function.
Next,we need to perform model selection which best performs with all the features selected and which more correlates to the
problem being solved.Here we can use a regression model, random forest, or KNN regressor to predict what attributes contribute
well to a player performance.
We can use evaluating metrics like calculating Mean square error or accuracy which best suitable for our analysis.
To make the model even better by opitmizing it, we need to perform hyperparameter tuning such that having a learning rate or
such factors to enhance the tarining of the model.
Cross-validation technique is one such which helps in choosing the best hyperparameters by creating a validation set from the
training set and to learn which parameter value best suits for the training of the model.
All these should be done in order to make our model have a balance between bias and variance trade-off, else our model might
become overfitted or underfitted.(doesn't perform well for novel data)
Lastly, once we get a balanced and better model, we can evaluate on the test data and calculate metrics like accuracy to know its
performance.
3) If you had batted ball exit speed data for hitters, with widely varying sample sizes, how would you estimate a hitter's true exit speed
skill level?
EXPLANATION: We are provided with batted ball exit speed data for hitters(Let's say)
EXPLANATION: We are provided with batted ball exit speed data for hitters(Let's say)
OBJECTIVE: To estimate a hitter's true exit speed skill level.
In this case, we need to focus on the Bayesian methods which provide a robust framework for estimating the true skill level. Bayesian
inference allows us to incorporate prior knowledge and update our beliefs with new data, taking into account the varying sample sizes.
This approach not only provides a more accurate estimate but also quantifies the uncertainty associated with the estimate.
To understand it more precisely, I am going to consider a dataset with random values of exit speed.
Let's say
Hitter 1: 10 samples, mean exit speed 90
Hitter 2: 50 samples, mean exit speed 95
Hitter 3: 30 samples, mean exit speed 92
# Posterior variance
posterior_variance = 1 / (n / sample_variance + 1 / sigma_prior**2)
# Posterior mean
posterior_mean = posterior_variance * (sample_mean * n / sample_variance + mu_prior / sigma_prior**2)
With the above method, we will be able to approximate a hitter's true exit speed skill level with varying sample sizes.
4) Our General Manager has a miraculous encounter with a baseball genie who offers to conjure for the team a player to be our
designated hitter. The genie offers us the choice of one of two players, a. Force-Field Fred (magically guaranteed to walk every time he
comes up to bat), or b. Long-Ball Larry (magically guaranteed to homer in 25% of his plate appearances but to strikeout in the other 75%,
outcomes distributed over a random schedule over the full season).
Which player would you rather have on our team? Why? Are there any circumstances regarding the performance of the rest of our team’s
offense for which you would change your answer?
EXPLANATION: We need to analyze between two players and put forth a statement stating who creates a better impact overall to a
team's success.
OBJECTIVE:We are using Bayesian interference to update our belief about the expected runs created by each player in the context of
the team's performance between Force-Field Fred and Long-Ball Larry.
Taking into consider the prior belief, I am assuming prior mean, and runs scored for both the players and perform analysis and find
the posterior distribution of runs.
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
# Priors for runs created per plate appearance
# Force-Field Fred: Prior belief that reflects an average performance
mu_prior_fred = 0.40 # Example prior mean
sigma_prior_fred = 0.02 # Example prior standard deviation
# Long-Ball Larry: Prior belief considering the 25% home run rate
mu_home_run_larry = 0.40 # Expected runs created per PA due to home runs
mu_prior_larry = 0.25 * mu_home_run_larry # Prior mean, adjusted for 25% home run rate
sigma_prior_larry = 0.10 # Example prior standard deviation
# Observed data: runs created per plate appearance (hypothetical)
observed_runs_fred = np.array([0.11, 0.09, 0.12, 0.10, 0.11])
observed_runs_larry = np.array([0.28, 0.32, 0.30, 0.31, 0.29])
Using Bayes' Theorem to update the prior distribution with the observed runs column to get the posterior distribution, which reflects
our updated belief about the player's impact on run creation.
# Posterior variance
posterior_variance = 1 / (n / sample_variance + 1 / sigma_prior**2)
posterior_std = np.sqrt(posterior_variance)
# Posterior mean
posterior_mean = posterior_variance * (sample_mean * n / sample_variance + mu_prior / sigma_prior**2)
# Calculate posteriors
posterior_mean_fred, posterior_std_fred = calculate_posterior(mu_prior_fred, sigma_prior_fred, observed_runs_fred
posterior_mean_larry, posterior_std_larry = calculate_posterior(mu_prior_larry, sigma_prior_larry, observed_runs_larr
So ultimately I would take Long-Ball Larry because of the runs he can create rather than Force-Field Larry as suggested above and
also in games you need to have an aggressive intent and take chances.
Also circumstancially speaking we need to have a mixture of players as a single player can win you matches, but a balanced team
can win you tournaments. Hence the team combination is relatively important while choosing a player but here Larry edges over
Fred.
1. Please answer the following question without using statistical software (using a calculator is acceptable).
Suppose batter has a true-talent batting average of .300 (he is expected to record hits in 30% of any sample of his official at-bats).
How probable is it that he could record fewer than 100 hits in 600 at-bats?
a.) Greater than 50%
b.) Between 10% and 50%
c.) Between 1% and 10%
d.) Between .1% and 1%
e.) Less than .1%
In a few sentences, please describe how you reached your conclusion.
EXPLANATION: We can understand that we need to find the probability of the batter hitting fewer than 100 hits in 600 at-bats.
Let 'X' be the event of batter hitting fewer than 100 hits in 600 at-bats.
OBJECTIVE: We need to approach this problem as a hit or miss, as those are the possible outcomes hence we can use binomial
distribution to arrive at the solution.
Number of at-bats(n)=600
Probability of getting a hit(p)=0.3
Mean=np=6000.3=180
Standard deviation=sqrt(np(1-p))=sqrt(6000.3(1-0.3))=11.22
Now, we need to find the cummulative probability, where P(X<100)=P(X=1)+P(X=2)+P(X=3)+....+P(X=99) which is a tedious process,
hence we approximate the X by z-standardization, where z=(X-Mean)/Standard deviation
z=(100-180)/11.22=>z=-7.13
From the z table, we can see than for very low values like z=-3.5 or lower, P(z<-3.5)=0.0001 ,hence for z=-7.13, it is even more less.
6) 11% of MLB players throw left handed. 32% of MLB players hit left handed. 85% of MLB players who throw left handed hit left handed.
Player X hits left handed. What is the probability that player X throws left handed?
EXPLANATION: Probability that a player throws left-handed(P_T)=0.11 Probability that a player hits left-handed(P_H)=0.32 Probability
that a player hits left-handed given they throw left-handed(P_H_given_T)=0.85
OBJECTIVE:To find the conditional probability that a player X throws left handed provided hits left handed.Using Bayes' Theorem
The probability that a player throws left-handed given they hit left-handed is: 0.2922
7) Breakout hitter Lou Rice is off to a tremendous start to the season, with hits in 43 of his first 100 at-bats. Going into the season our
7) Breakout hitter Lou Rice is off to a tremendous start to the season, with hits in 43 of his first 100 at-bats. Going into the season our
belief about Lou’s true-talent batting average, i.e., what Lou’s hits per at-bat rate would be in the limit, could be described as a Beta
distribution parameterized with α = 58 and β = 142. Given our beliefs at the outset of the season and Lou’s performance through his first
100 at-bats, what should we believe Lou’s chance to be batting over .400 (hits in at least 40% of his at-bats) after he’s accumulated 500
at-bats on the season in total? Describe your reasoning.
EXPLANATION: We are provided with the information that hitter has 43 hits of his 100 at-bats.
hits_observed = 43; total_at_bats = 100;
But we are provided with a belief where his batting average described in terms of Beta distribution with α = 58 ;β = 142.
So, Lou Rice's Expected batting average E(X)=α/(α+β) = 58/(58+142) =0.29
OBJECTIVE:We need to find the probability that Lou's batting average will exceed over 0.4 after he has accumulated 500 at-bats in total.
# Calculating the probability of batting over .400 (hits in at least 40% of at-bats)
desired_batting_average = 0.400
print(f"The probability that Lou Rice will bat over .400 after 500 at-bats is: {probability_over_400:.4f}")
The probability that Lou Rice will bat over .400 after 500 at-bats is: 0.0115
To have a better understanding we can visualize how the probability varies after taking into consideration of our belief by
analyzing the posterior distribution as follows.
# Generate x values
x = np.linspace(0, 1, 1000)
# Generate CDF values
y = stats.beta.cdf(x, alpha_posterior, beta_posterior)
# Show plot
plt.grid(True)
plt.show()
8) Over the last few seasons, there’s been a lot of discussion in and around baseball about the qualities of the ball itself. Many have
claimed that the ball has become ‘juiced’, causing batted balls to travel further in the air than they had in previous seasons and resulting
in notably higher home run rates. A prominent hypothesis as to the manner with which the ball changed concerns it’s drag coefficient.
Drag is a measure of a projectile’s sensitivity to air resistance opposite the direction it’s traveling. A projectile with a lower drag coefficient
will travel through the air further than a similar projectile with a higher drag coefficient, all else equal. Provided is a random sample of
major league batted ball data from the past five seasons. You’re tasked with analyzing the data to determine whether or not the ball
actually varied over these five seasons. Provide any code you used to generate your conclusion (a markdown or notebook file is
recommended!) and present your argument for when and how the ball varied. Illustrations/data visualizations to help communicate your
findings are encouraged. Please spend no more than 4 hours on this question. Fields:
1. year
2. month
3. pitcher_throws ('L' for left handed pitcher, 'R' for right handed pitcher)
4. bat_side ('L' for left handed batter, 'R' for right handed batter)
5. pitch_type ('FF' for four-seam fastball, 'FT' for two-seam fastball)
6. release_speed (magnitude of velocity of the pitch towards the plate at 50' in mph)
7. plate_speed (magnitude of velocity of pitch as it crosses the front of home plate in mph)
8. hit_exit_speed (magnitude of velocity of the batted ball upon contact in mph)
9. hit_spinrate (rate of rotation of the ball upon contact in rpm)
10. hit_vertical_angle (launch angle; direction of the ball off the bat on the vertical plane -- 0 degrees is parallel to the ground, positive is
up, negative is down, in degrees)
11. hit_bearing (direction from the tip of home plate to the initial landing position of the batted ball on the horizontal plane -- 0 degrees is
directly up the middle, positive is towards the 1st base side, negative is towards the 3rd base side, in degrees)
12. hit_distance (distance between the tip of home plate and the initial landing position of the batted ball in feet)
13. event_result (text field describing the category of batted ball event outcome)
EXPALANTION: We are provided with a data set to analyze if the ball has varied over the last 5 years or not.
OBJECTIVE: To compare the hit_distance as well as find the drag coefficient and compare over the years to support the hypothesis
using Bayesian analysis methods.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 year 5000 non-null int64
1 month 5000 non-null int64
2 pitcher_throws 5000 non-null object
3 bat_side 5000 non-null object
4 pitch_type 5000 non-null object
5 release_speed 5000 non-null float64
6 plate_speed 5000 non-null float64
7 hit_exit_speed 4522 non-null float64
8 hit_spinrate 2939 non-null float64
9 hit_vertical_angle 4522 non-null float64
10 hit_bearing 4522 non-null float64
11 hit_distance 4522 non-null float64
12 event_result 5000 non-null object
dtypes: float64(7), int64(2), object(4)
memory usage: 507.9+ KB
data.describe()
count 5000.000000 5000.000000 5000.000000 5000.000000 4522.000000 2939.000000 4522.000000 4522.000000 4522.000000
mean 2017.000000 6.555600 91.882837 84.985879 89.997665 2747.016428 11.998355 -0.310381 178.682747
std 1.414355 1.734217 2.817465 2.649820 13.984109 1287.734854 24.391452 26.953949 137.925585
min 2015.000000 3.000000 79.315483 73.368980 19.051146 414.808960 -73.101883 -179.131378 0.506874
25% 2016.000000 5.000000 90.176728 83.369785 82.775057 1682.917419 -3.684100 -18.847590 26.810730
50% 2017.000000 7.000000 91.971830 85.065407 92.808498 2574.025635 12.943902 -0.594337 188.074951
75% 2018.000000 8.000000 93.789923 86.806101 100.107733 3729.503418 28.040596 18.584233 303.585815
max 2019.000000 10.000000 103.396928 95.852409 117.753525 6855.195312 87.656059 176.065750 468.331482
year 0
month 0
pitcher_throws 0
bat_side 0
pitch_type 0
release_speed 0
plate_speed 0
hit_exit_speed 478
hit_spinrate 2061
hit_vertical_angle 478
hit_bearing 478
hit_distance 478
event_result 0
dtype: int64
year 1.885955 -0.063778 0.311990 0.481316 0.107759 3.125765e+02 1.635863 -0.128138 5.741926
month -0.063778 3.039575 0.396104 0.519096 0.037471 4.889908e+01 0.637760 0.888197 -0.161634
release_speed 0.311990 0.396104 7.710609 7.016733 -0.957655 1.569088e+02 0.520724 -1.127049 1.233447
plate_speed 0.481316 0.519096 7.016733 6.803553 -0.785150 1.301968e+02 -0.723909 -0.141827 -0.085241
hit_exit_speed 0.107759 0.037471 -0.957655 -0.785150 161.774480 -2.974013e+03 -42.857555 6.045054 505.754846
hit_spinrate 312.576517 48.899077 156.908803 130.196805 -2974.013030 1.658261e+06 13059.063748 3406.279827 21993.413916
hit_vertical_angle 1.635863 0.637760 0.520724 -0.723909 -42.857555 1.305906e+04 247.574851 7.174752 709.790666
hit_bearing -0.128138 0.888197 -1.127049 -0.141827 6.045054 3.406280e+03 7.174752 689.055516 50.606922
hit_distance 5.741926 -0.161634 1.233447 -0.085241 505.754846 2.199341e+04 709.790666 50.606922 11114.233301
We can infer that from the plot, there is highly significant correlation between plate_speed and release speed.
We can also see hit_exit_speed and hit_vertical_angle has significant correlation with hit_distance.
It can be seen that the average hit distance has marginally increased over the years from 2015 to 2017 and then stayed pretty same.
But this summarizes all event outcomes, lets analyze for each outcome.
X = data[features]
y = data[target]
# Conclusion
print("Mean Hitting Distance per Year:")
print(mean_distance_per_year)
Mean Hitting Distance per Year:
year
2015 373.854976
2016 387.742645
2017 292.457418
2018 358.723862
2019 307.871089
Name: hit_distance, dtype: float64
# Group by year and calculate mean hitting distance for home runs
mean_distance_per_year_home_run = home_run_data.groupby('year')['hit_distance'].mean().reset_index()
There is no much variation in the hitting distance when it comes to home_run where drag coefficient plays a tremendous role as the
projectile's sensitivity to air resistance will be significant.
Still will try to use a statistical test to confirm on the same as for a double_pay, there is a big difference.
# Group by year and calculate mean hitting distance for home runs
mean_distance_per_year_double_play = double_play_data.groupby('year')['hit_distance'].mean().reset_index()
mean_distance_per_year = data.groupby('year')['hit_distance'].mean()
variance_distance_per_year = data.groupby('year')['hit_distance'].var()
# Perform ANOVA test to see if there are significant differences between years
years = data['year'].unique()
grouped_distances = [data[data['year'] == year]['hit_distance'].values for year in years]
f_value, p_value = stats.f_oneway(*grouped_distances)
print("ANOVA test results: F-value =", f_value, ", P-value =", p_value)
# Conclusion
if p_value < 0.05:
print("There is a significant difference in hitting distances over the years, suggesting that the ball might have
else:
print("There is no significant difference in hitting distances over the years, suggesting that the ball might not
We can see that the test concludes there is a significant change overall, yet will check for separate event types.
event_type = 'home_run'
event_data = data[data['event_result'] == event_type]
mean_distance_per_year_event = event_data.groupby('year')['hit_distance'].mean().reset_index()
# Perform ANOVA test to see if there are significant differences between years for the specific event
years = event_data['year'].unique()
grouped_distances = [event_data[event_data['year'] == year]['hit_distance'].values for year in years]
f_value, p_value = stats.f_oneway(*grouped_distances)
print(f"ANOVA test results for {event_type}: F-value =", f_value, ", P-value =", p_value)
# Conclusion
if p_value < 0.05:
print(f"There is a significant difference in hitting distances over the years for {event_type}, suggesting that t
else:
print(f"There is no significant difference in hitting distances over the years for {event_type}, suggesting that
We can infer that for home runs, there is no significant variation which mean the ball has not varied.
event_type = 'triple'
event_data = data[data['event_result'] == event_type]
mean_distance_per_year_event = event_data.groupby('year')['hit_distance'].mean().reset_index()
# Perform ANOVA test to see if there are significant differences between years for the specific event
years = event_data['year'].unique()
grouped_distances = [event_data[event_data['year'] == year]['hit_distance'].values for year in years]
f_value, p_value = stats.f_oneway(*grouped_distances)
print(f"ANOVA test results for {event_type}: F-value =", f_value, ", P-value =", p_value)
# Conclusion
if p_value < 0.05:
print(f"There is a significant difference in hitting distances over the years for {event_type}, suggesting that t
else:
print(f"There is no significant difference in hitting distances over the years for {event_type}, suggesting that
For a triple which needs to be hit with a better vertical angle and power travelling through the air also suggests there isn't any
significant change.
Let's calculate the drag coefficient and try to check the variation of the ball.
# Constants
g = 32.174 # gravity in ft/s^2
data.head()
year month pitcher_throws bat_side pitch_type release_speed plate_speed hit_exit_speed hit_spinrate hit_vertical_angle hit_bearing hit_d
data.head()
year month pitcher_throws bat_side pitch_type release_speed plate_speed hit_exit_speed hit_spinrate hit_vertical_angle hit_bearing hit_d
import pandas as pd
import numpy as np
from scipy import stats
# Perform ANOVA test to see if there are significant differences between years
years = data['year'].unique()
grouped_drag_coefficients = [data[data['year'] == year]['drag_coefficient'].values for year in years]
f_value, p_value = stats.f_oneway(*grouped_drag_coefficients)
print("ANOVA test results: F-value =", f_value, ", P-value =", p_value)
The drag coefficient which we calculated also states that the ball has not varied over the years.
From the analysis and the statistical tests held, we can see at most cases it suggests the ball hasn't varied much and the cases
showing a change could possibly be a data imbalance which moight cause a bias. But on calculating the drag coefficient it is seen
that the ball has not varied.
Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js