0% found this document useful (0 votes)
142 views20 pages

Astros

Uploaded by

karthikeyan R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
142 views20 pages

Astros

Uploaded by

karthikeyan R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Houston Astros Questionnaire

1) Given a set of hitting performance metrics, how do you choose which, if any, of the metrics are useful for evaluating players? What
properties do you think a good measure of player performance should exhibit? Note that we are not asking for specific baseball metrics
you like, but rather a general approach to identifying whether a metric is useful.

EXPLANATION: When we are provided with a dataset of hitting performance metrics, and need to figure out which metrics/attributes are
useful for our analysis, we need to perform Exploratory Data Analysis(EDA).
Before that we need to ensure that our data is clean and doesn't consist of error values.
It is performed by handling the missing values, detecting the outliers, and scaling the attributes for uniformity.
OBJECTIVE: To showcase various methods to approach feature selection.
The basic idea in evaluating the metrics is to detect which metric has high covariance and correlation towards the desired
objective/output.
Covariance: It shows how two variables are closely related and change accordingly, which significantly constitutes the strength of the
relationship between them.
Correlation: It shows the strength of relationship as well as the direction which means the change in values affecting them either
positively or negatively.
I have taken a sample data to show more in detail.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# Creating a random dataset with atrributes as follows
example = {
'player': ['Player1', 'Player2', 'Player3', 'Player4', 'Player5'],
'batting_avg': [0.320, 0.275, 0.305, 0.290, 0.310],
'on_base_pct': [0.400, 0.350, 0.380, 0.370, 0.390],
'slugging_pct': [0.550, 0.450, 0.500, 0.480, 0.520],
'RBIs': [90, 70, 85, 80, 88],
'runs': [100, 85, 95, 90, 98],
'stolen_bases': [5, 15, 8, 7, 12],
'age': [27,21,20,28,25]
}
# creating the dataframe
df = pd.DataFrame(example)
sns.pairplot(df.drop('player', axis=1))
plt.suptitle('Pairplot of Hitting Metrics', y=1.02)
plt.show()
# Correlation Analysis and visualization of heatmap
correlation_matrix = df.drop('player', axis=1).corr()
print("Correlation Matrix:")
print(correlation_matrix)
plt.figure(figsize=(10, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix of Hitting Metrics')
plt.show()

Correlation Matrix:
batting_avg on_base_pct slugging_pct RBIs runs \
batting_avg 1.000000 0.992540 0.984185 0.982647 0.995701
on_base_pct 0.992540 1.000000 0.989813 0.986056 0.991678
slugging_pct 0.984185 0.989813 1.000000 0.953463 0.978235
RBIs 0.982647 0.986056 0.953463 1.000000 0.984983
runs 0.995701 0.991678 0.978235 0.984983 1.000000
stolen_bases -0.648027 -0.663151 -0.650462 -0.668256 -0.590085
age 0.337312 0.444936 0.442146 0.398734 0.337700

stolen_bases age
batting_avg -0.648027 0.337312
on_base_pct -0.663151 0.444936
slugging_pct -0.650462 0.442146
RBIs -0.668256 0.398734
runs -0.590085 0.337700
stolen_bases 1.000000 -0.545600
age -0.545600 1.000000
We can clearly see that compared to other features , stolen bases and age have less correlation with RBIs which indicates it doesn't
have significant relevance to our target variable.

Filter Methods:

Filter methods select features based on their scores in statistical tests for their correlation with the outcome variable. Examples
include the chi-squared test, information gain, and correlation coefficient scores. These methods are fast and straightforward but
they ignore the potential combined effect of individual features.

Wrapper Methods:

Wrapper methods consider the selection of a set of features as a search problem, where different combinations are prepared,
evaluated and compared to other combinations. A predictive model is used to evaluate a combination of features and assign a score
based on model accuracy. Examples of wrapper methods are recursive feature elimination and forward selection.

Embedded Methods:

Embedded methods learn which features best contribute to the accuracy of the model while the model is being created. The most
common type of embedded feature selection methods are regularization methods. Regularization methods are also called
penalization methods that introduce additional constraints into the optimization of a predictive algorithm (like a regression algorithm)
that bias the model toward lower complexity (fewer coefficients).

#Another example here where I have used embedded method to show feature selection with their importance
from sklearn.model_selection import train_test_split
X = df.drop(['RBIs', 'player'], axis=1)
y = df['RBIs']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Feature Importance using Random Forest


from sklearn.ensemble import RandomForestRegressor
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X, y)
feature_importances = rf_model.feature_importances_
print("Feature Importances:", feature_importances)
# Visualize Feature Importances
features = X.columns
plt.figure(figsize=(8, 6))
plt.barh(features, feature_importances)
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.title('Feature Importance from Random Forest')
plt.show()

Feature Importances: [0.21061774 0.20053971 0.14464277 0.21270678 0.13329108 0.09820191]

Once we get to know the correlation values, we can now extract the important features which necessarily means feature selection.
We are eradicating the irrelevant features, and focusing on the important features.

2) When developing a model to predict player performance, what methods do you employ to ensure that your model is useful on novel
data?

EXPLANATION: Predicting player performance is imminent to better the performance of not just the player,but also uplifts the overall
team performance.
OBJECTIVE: Discuss on the methods to employ in a model that works well for novel data.
Once we acquire the historical data of the player, the foremost work is to clean the data removing outliers, imputing missing values with
mean , median values and normalizing the values.
Random splitting the entire data into training and testing data is ideal in order to make sure the model we develop is generalized and
works well for the novel data(test data/any new data being added in future).
Now, the question is just because we split the data does the model which we develop works well for novel data? Definitely Not!!!
We need to choose the best model which generalizes the work of predicting the performance of player without any bias, but this can be
by,

Feature Selection using Correlation , Lasso Regression or KNN Best features inbuilt function.
Next,we need to perform model selection which best performs with all the features selected and which more correlates to the
problem being solved.Here we can use a regression model, random forest, or KNN regressor to predict what attributes contribute
well to a player performance.
We can use evaluating metrics like calculating Mean square error or accuracy which best suitable for our analysis.
To make the model even better by opitmizing it, we need to perform hyperparameter tuning such that having a learning rate or
such factors to enhance the tarining of the model.
Cross-validation technique is one such which helps in choosing the best hyperparameters by creating a validation set from the
training set and to learn which parameter value best suits for the training of the model.
All these should be done in order to make our model have a balance between bias and variance trade-off, else our model might
become overfitted or underfitted.(doesn't perform well for novel data)
Lastly, once we get a balanced and better model, we can evaluate on the test data and calculate metrics like accuracy to know its
performance.

3) If you had batted ball exit speed data for hitters, with widely varying sample sizes, how would you estimate a hitter's true exit speed
skill level?

EXPLANATION: We are provided with batted ball exit speed data for hitters(Let's say)
EXPLANATION: We are provided with batted ball exit speed data for hitters(Let's say)
OBJECTIVE: To estimate a hitter's true exit speed skill level.
In this case, we need to focus on the Bayesian methods which provide a robust framework for estimating the true skill level. Bayesian
inference allows us to incorporate prior knowledge and update our beliefs with new data, taking into account the varying sample sizes.
This approach not only provides a more accurate estimate but also quantifies the uncertainty associated with the estimate.
To understand it more precisely, I am going to consider a dataset with random values of exit speed.
Let's say
Hitter 1: 10 samples, mean exit speed 90
Hitter 2: 50 samples, mean exit speed 95
Hitter 3: 30 samples, mean exit speed 92

from scipy.stats import norm


#Below is the random data generated with diferent sample sizes and speed values for 3 different players.
exit_speeds = {
'hitter1':np.random.normal(loc=90, scale=5, size=10),
'hitter2':np.random.normal(loc=95, scale=8, size=50),
'hitter3':np.random.normal(loc=92, scale=6, size=30)
}
# Visualize exit speeds
plt.figure(figsize=(10, 6))
for player, speeds in exit_speeds.items():
sns.histplot(speeds, kde=True, label=player, stat='density', linewidth=0)
plt.title('Distribution of Exit Speeds')
plt.xlabel('Exit Speed (mph)')
plt.ylabel('Density')
plt.legend()
plt.show()

import scipy.stats as stats


import numpy as np
import matplotlib.pyplot as plt
# Setting up the Prior distribution parameters
mu_prior = 90
sigma_prior = 10
posterior_means = {}
# Function to calculate the posterior mean and variance
def calculate_posterior(data, mu_prior, sigma_prior):
n = len(data)
sample_mean = np.mean(data)
sample_variance = np.var(data, ddof=1)

# Posterior variance
posterior_variance = 1 / (n / sample_variance + 1 / sigma_prior**2)

# Posterior mean
posterior_mean = posterior_variance * (sample_mean * n / sample_variance + mu_prior / sigma_prior**2)

return posterior_mean, np.sqrt(posterior_variance)

# Calculate posterior for each hitter


for hitter, speeds in exit_speeds.items():
posterior_mean, posterior_std = calculate_posterior(speeds, mu_prior, sigma_prior)
posterior_means[hitter] = posterior_mean
print(f"{hitter}: Posterior Mean = {posterior_mean:.2f}, Posterior Std = {posterior_std:.2f}")

# Plotting posterior distributions


x = np.linspace(80, 105, 1000)
for hitter, speeds in exit_speeds.items():
posterior_mean, posterior_std = calculate_posterior(speeds, mu_prior, sigma_prior)
y = stats.norm.pdf(x, posterior_mean, posterior_std)
plt.plot(x, y, label=f'{hitter}: {posterior_mean:.2f} ± {posterior_std:.2f}')

# Labels and title


plt.xlabel('Exit Speed')
plt.ylabel('Density')
plt.title('Posterior Distributions of True Exit Speed Skill Levels')
plt.legend()
plt.grid(True)
plt.show()

hitter1: Posterior Mean = 89.53, Posterior Std = 2.29


hitter2: Posterior Mean = 95.94, Posterior Std = 1.08
hitter3: Posterior Mean = 92.64, Posterior Std = 1.02

With the above method, we will be able to approximate a hitter's true exit speed skill level with varying sample sizes.

4) Our General Manager has a miraculous encounter with a baseball genie who offers to conjure for the team a player to be our
designated hitter. The genie offers us the choice of one of two players, a. Force-Field Fred (magically guaranteed to walk every time he
comes up to bat), or b. Long-Ball Larry (magically guaranteed to homer in 25% of his plate appearances but to strikeout in the other 75%,
outcomes distributed over a random schedule over the full season).

Which player would you rather have on our team? Why? Are there any circumstances regarding the performance of the rest of our team’s
offense for which you would change your answer?

EXPLANATION: We need to analyze between two players and put forth a statement stating who creates a better impact overall to a
team's success.
OBJECTIVE:We are using Bayesian interference to update our belief about the expected runs created by each player in the context of
the team's performance between Force-Field Fred and Long-Ball Larry.

Taking into consider the prior belief, I am assuming prior mean, and runs scored for both the players and perform analysis and find
the posterior distribution of runs.

import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
# Priors for runs created per plate appearance
# Force-Field Fred: Prior belief that reflects an average performance
mu_prior_fred = 0.40 # Example prior mean
sigma_prior_fred = 0.02 # Example prior standard deviation
# Long-Ball Larry: Prior belief considering the 25% home run rate
mu_home_run_larry = 0.40 # Expected runs created per PA due to home runs
mu_prior_larry = 0.25 * mu_home_run_larry # Prior mean, adjusted for 25% home run rate
sigma_prior_larry = 0.10 # Example prior standard deviation
# Observed data: runs created per plate appearance (hypothetical)
observed_runs_fred = np.array([0.11, 0.09, 0.12, 0.10, 0.11])
observed_runs_larry = np.array([0.28, 0.32, 0.30, 0.31, 0.29])

Using Bayes' Theorem to update the prior distribution with the observed runs column to get the posterior distribution, which reflects
our updated belief about the player's impact on run creation.

# Posterior parameters calculation function


def calculate_posterior(mu_prior, sigma_prior, observed_data):
n = len(observed_data)
sample_mean = np.mean(observed_data)
sample_variance = np.var(observed_data, ddof=1)

# Posterior variance
posterior_variance = 1 / (n / sample_variance + 1 / sigma_prior**2)
posterior_std = np.sqrt(posterior_variance)

# Posterior mean
posterior_mean = posterior_variance * (sample_mean * n / sample_variance + mu_prior / sigma_prior**2)

return posterior_mean, posterior_std

# Calculate posteriors
posterior_mean_fred, posterior_std_fred = calculate_posterior(mu_prior_fred, sigma_prior_fred, observed_runs_fred
posterior_mean_larry, posterior_std_larry = calculate_posterior(mu_prior_larry, sigma_prior_larry, observed_runs_larr

print(f"Force-Field Fred: Posterior Mean = {posterior_mean_fred:.4f}, Posterior Std = {posterior_std_fred:.4f}")


print(f"Long-Ball Larry: Posterior Mean = {posterior_mean_larry:.4f}, Posterior Std = {posterior_std_larry:.4f}"

Force-Field Fred: Posterior Mean = 0.1239, Posterior Std = 0.0049


Long-Ball Larry: Posterior Mean = 0.2990, Posterior Std = 0.0071

# Plotting posterior distributions


x = np.linspace(0, 0.5, 1000)
posterior_fred = stats.norm.pdf(x, posterior_mean_fred, posterior_std_fred)
posterior_larry = stats.norm.pdf(x, posterior_mean_larry, posterior_std_larry)

plt.plot(x, posterior_fred, label=f'Fred: {posterior_mean_fred:.4f} ± {posterior_std_fred:.4f}')


plt.plot(x, posterior_larry, label=f'Larry: {posterior_mean_larry:.4f} ± {posterior_std_larry:.4f}')

plt.xlabel('Runs Created per Plate Appearance')


plt.ylabel('Density')
plt.title('Posterior Distributions of Runs Created')
plt.legend()
plt.grid(True)
plt.show()

So ultimately I would take Long-Ball Larry because of the runs he can create rather than Force-Field Larry as suggested above and
also in games you need to have an aggressive intent and take chances.
Also circumstancially speaking we need to have a mixture of players as a single player can win you matches, but a balanced team
can win you tournaments. Hence the team combination is relatively important while choosing a player but here Larry edges over
Fred.

1. Please answer the following question without using statistical software (using a calculator is acceptable).
Suppose batter has a true-talent batting average of .300 (he is expected to record hits in 30% of any sample of his official at-bats).
How probable is it that he could record fewer than 100 hits in 600 at-bats?
a.) Greater than 50%
b.) Between 10% and 50%
c.) Between 1% and 10%
d.) Between .1% and 1%
e.) Less than .1%
In a few sentences, please describe how you reached your conclusion.

EXPLANATION: We can understand that we need to find the probability of the batter hitting fewer than 100 hits in 600 at-bats.
Let 'X' be the event of batter hitting fewer than 100 hits in 600 at-bats.

OBJECTIVE: We need to approach this problem as a hit or miss, as those are the possible outcomes hence we can use binomial
distribution to arrive at the solution.

Number of at-bats(n)=600
Probability of getting a hit(p)=0.3

Mean=np=6000.3=180
Standard deviation=sqrt(np(1-p))=sqrt(6000.3(1-0.3))=11.22
Now, we need to find the cummulative probability, where P(X<100)=P(X=1)+P(X=2)+P(X=3)+....+P(X=99) which is a tedious process,
hence we approximate the X by z-standardization, where z=(X-Mean)/Standard deviation
z=(100-180)/11.22=>z=-7.13

From the z table, we can see than for very low values like z=-3.5 or lower, P(z<-3.5)=0.0001 ,hence for z=-7.13, it is even more less.

ANSWER:e.) Less than .1%

6) 11% of MLB players throw left handed. 32% of MLB players hit left handed. 85% of MLB players who throw left handed hit left handed.
Player X hits left handed. What is the probability that player X throws left handed?

EXPLANATION: Probability that a player throws left-handed(P_T)=0.11 Probability that a player hits left-handed(P_H)=0.32 Probability
that a player hits left-handed given they throw left-handed(P_H_given_T)=0.85
OBJECTIVE:To find the conditional probability that a player X throws left handed provided hits left handed.Using Bayes' Theorem

P_T_given_H = (P_H_given_T) * P_T) / P_H

# Display the result


print(f"The probability that a player throws left-handed given they hit left-handed is: {P_T_given_H:.4f}")

# Visualization: Plotting the probabilities


labels = ['Throws Left-Handed and Hits Left-Handed', 'Other Cases']
sizes = [P_T_given_H, 1 - P_T_given_H]
colors = ['skyblue', 'lightcoral']
explode = (0.1, 0) # explode the first slice
fig1, ax1 = plt.subplots()
ax1.pie(sizes, explode=explode, labels=labels, colors=colors, autopct='%1.2f%%',
shadow=True, startangle=90)
ax1.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
plt.title('Probability of Throwing Left-Handed Given Hitting Left-Handed')
plt.show()

The probability that a player throws left-handed given they hit left-handed is: 0.2922

7) Breakout hitter Lou Rice is off to a tremendous start to the season, with hits in 43 of his first 100 at-bats. Going into the season our
7) Breakout hitter Lou Rice is off to a tremendous start to the season, with hits in 43 of his first 100 at-bats. Going into the season our
belief about Lou’s true-talent batting average, i.e., what Lou’s hits per at-bat rate would be in the limit, could be described as a Beta
distribution parameterized with α = 58 and β = 142. Given our beliefs at the outset of the season and Lou’s performance through his first
100 at-bats, what should we believe Lou’s chance to be batting over .400 (hits in at least 40% of his at-bats) after he’s accumulated 500
at-bats on the season in total? Describe your reasoning.

EXPLANATION: We are provided with the information that hitter has 43 hits of his 100 at-bats.
hits_observed = 43; total_at_bats = 100;
But we are provided with a belief where his batting average described in terms of Beta distribution with α = 58 ;β = 142.
So, Lou Rice's Expected batting average E(X)=α/(α+β) = 58/(58+142) =0.29
OBJECTIVE:We need to find the probability that Lou's batting average will exceed over 0.4 after he has accumulated 500 at-bats in total.

import scipy.stats as stats

# Prior belief (before the season)


alpha_prior = 58
beta_prior = 142

# Observation (first 100 at-bats)


hits_observed = 43
total_at_bats = 100

# Updating Beta distribution parameters considering the belief probability as well.


alpha_posterior = alpha_prior + hits_observed
beta_posterior = beta_prior + (total_at_bats - hits_observed)

# Calculating the probability of batting over .400 (hits in at least 40% of at-bats)
desired_batting_average = 0.400

# Computing the probability using the Beta distribution CDF,P(X>0.4)


probability_over_400 = 1 - stats.beta.cdf(desired_batting_average, alpha_posterior, beta_posterior)

print(f"The probability that Lou Rice will bat over .400 after 500 at-bats is: {probability_over_400:.4f}")

The probability that Lou Rice will bat over .400 after 500 at-bats is: 0.0115

To have a better understanding we can visualize how the probability varies after taking into consideration of our belief by
analyzing the posterior distribution as follows.

# Calculating CDF value at 0.400


cdf_at_desired_ba = stats.beta.cdf(desired_batting_average, alpha_posterior, beta_posterior)
probability_over_400 = 1 - cdf_at_desired_ba

# Generate x values
x = np.linspace(0, 1, 1000)
# Generate CDF values
y = stats.beta.cdf(x, alpha_posterior, beta_posterior)

# Plot the CDF


plt.figure(figsize=(10, 6))
plt.plot(x, y, label='Beta CDF', color='blue')

# Highlight the point at x = 0.400


plt.axvline(desired_batting_average, color='red', linestyle='dashed', linewidth=1, label=f'x = {desired_batting_avera
plt.axhline(cdf_at_desired_ba, color='green', linestyle='dashed', linewidth=1, label=f'CDF at x = {desired_batting_av

# Add text annotations


plt.text(desired_batting_average + 0.02, 0.5, f'CDF(0.400) = {cdf_at_desired_ba:.4f}', color='green')
plt.text(0.5, cdf_at_desired_ba + 0.02, f'1 - CDF(0.400) = {probability_over_400:.4f}', color='red')

# Labels and title


plt.xlabel('Batting Average')
plt.ylabel('CDF')
plt.title('Cumulative Distribution Function (CDF) for Lou Rice\'s Batting Average')
plt.legend()

# Show plot
plt.grid(True)
plt.show()
8) Over the last few seasons, there’s been a lot of discussion in and around baseball about the qualities of the ball itself. Many have
claimed that the ball has become ‘juiced’, causing batted balls to travel further in the air than they had in previous seasons and resulting
in notably higher home run rates. A prominent hypothesis as to the manner with which the ball changed concerns it’s drag coefficient.
Drag is a measure of a projectile’s sensitivity to air resistance opposite the direction it’s traveling. A projectile with a lower drag coefficient
will travel through the air further than a similar projectile with a higher drag coefficient, all else equal. Provided is a random sample of
major league batted ball data from the past five seasons. You’re tasked with analyzing the data to determine whether or not the ball
actually varied over these five seasons. Provide any code you used to generate your conclusion (a markdown or notebook file is
recommended!) and present your argument for when and how the ball varied. Illustrations/data visualizations to help communicate your
findings are encouraged. Please spend no more than 4 hours on this question. Fields:

1. year
2. month
3. pitcher_throws ('L' for left handed pitcher, 'R' for right handed pitcher)
4. bat_side ('L' for left handed batter, 'R' for right handed batter)
5. pitch_type ('FF' for four-seam fastball, 'FT' for two-seam fastball)
6. release_speed (magnitude of velocity of the pitch towards the plate at 50' in mph)
7. plate_speed (magnitude of velocity of pitch as it crosses the front of home plate in mph)
8. hit_exit_speed (magnitude of velocity of the batted ball upon contact in mph)
9. hit_spinrate (rate of rotation of the ball upon contact in rpm)
10. hit_vertical_angle (launch angle; direction of the ball off the bat on the vertical plane -- 0 degrees is parallel to the ground, positive is
up, negative is down, in degrees)
11. hit_bearing (direction from the tip of home plate to the initial landing position of the batted ball on the horizontal plane -- 0 degrees is
directly up the middle, positive is towards the 1st base side, negative is towards the 3rd base side, in degrees)

12. hit_distance (distance between the tip of home plate and the initial landing position of the batted ball in feet)

13. event_result (text field describing the category of batted ball event outcome)

EXPALANTION: We are provided with a data set to analyze if the ball has varied over the last 5 years or not.
OBJECTIVE: To compare the hit_distance as well as find the drag coefficient and compare over the years to support the hypothesis
using Bayesian analysis methods.

#importing the necessary libraries


import numpy as np
import pandas as pd
import sklearn
import seaborn as sns
import matplotlib.pyplot as plt

#loading the dataset to a dataframe


data=pd.read_csv(r'C:\Users\karth\Downloads\data_sample.csv')

#getting an overview about the attributes and their count


data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 year 5000 non-null int64
1 month 5000 non-null int64
2 pitcher_throws 5000 non-null object
3 bat_side 5000 non-null object
4 pitch_type 5000 non-null object
5 release_speed 5000 non-null float64
6 plate_speed 5000 non-null float64
7 hit_exit_speed 4522 non-null float64
8 hit_spinrate 2939 non-null float64
9 hit_vertical_angle 4522 non-null float64
10 hit_bearing 4522 non-null float64
11 hit_distance 4522 non-null float64
12 event_result 5000 non-null object
dtypes: float64(7), int64(2), object(4)
memory usage: 507.9+ KB

data.describe()

year month release_speed plate_speed hit_exit_speed hit_spinrate hit_vertical_angle hit_bearing hit_distance

count 5000.000000 5000.000000 5000.000000 5000.000000 4522.000000 2939.000000 4522.000000 4522.000000 4522.000000

mean 2017.000000 6.555600 91.882837 84.985879 89.997665 2747.016428 11.998355 -0.310381 178.682747

std 1.414355 1.734217 2.817465 2.649820 13.984109 1287.734854 24.391452 26.953949 137.925585

min 2015.000000 3.000000 79.315483 73.368980 19.051146 414.808960 -73.101883 -179.131378 0.506874

25% 2016.000000 5.000000 90.176728 83.369785 82.775057 1682.917419 -3.684100 -18.847590 26.810730

50% 2017.000000 7.000000 91.971830 85.065407 92.808498 2574.025635 12.943902 -0.594337 188.074951

75% 2018.000000 8.000000 93.789923 86.806101 100.107733 3729.503418 28.040596 18.584233 303.585815

max 2019.000000 10.000000 103.396928 95.852409 117.753525 6855.195312 87.656059 176.065750 468.331482

#checking the null values


data.isnull().sum()

year 0
month 0
pitcher_throws 0
bat_side 0
pitch_type 0
release_speed 0
plate_speed 0
hit_exit_speed 478
hit_spinrate 2061
hit_vertical_angle 478
hit_bearing 478
hit_distance 478
event_result 0
dtype: int64

#dropping the null values


data=data.dropna()

#finding the covariance to find which features have an impact on hit_distance


data.cov()

C:\Users\karth\AppData\Local\Temp\ipykernel_40188\360834189.py:2: FutureWarning: The default value of numeric_o


nly in DataFrame.cov is deprecated. In a future version, it will default to False. Select only valid columns or
specify the value of numeric_only to silence this warning.
data.cov()
year month release_speed plate_speed hit_exit_speed hit_spinrate hit_vertical_angle hit_bearing hit_distance

year 1.885955 -0.063778 0.311990 0.481316 0.107759 3.125765e+02 1.635863 -0.128138 5.741926

month -0.063778 3.039575 0.396104 0.519096 0.037471 4.889908e+01 0.637760 0.888197 -0.161634

release_speed 0.311990 0.396104 7.710609 7.016733 -0.957655 1.569088e+02 0.520724 -1.127049 1.233447

plate_speed 0.481316 0.519096 7.016733 6.803553 -0.785150 1.301968e+02 -0.723909 -0.141827 -0.085241

hit_exit_speed 0.107759 0.037471 -0.957655 -0.785150 161.774480 -2.974013e+03 -42.857555 6.045054 505.754846

hit_spinrate 312.576517 48.899077 156.908803 130.196805 -2974.013030 1.658261e+06 13059.063748 3406.279827 21993.413916

hit_vertical_angle 1.635863 0.637760 0.520724 -0.723909 -42.857555 1.305906e+04 247.574851 7.174752 709.790666

hit_bearing -0.128138 0.888197 -1.127049 -0.141827 6.045054 3.406280e+03 7.174752 689.055516 50.606922

hit_distance 5.741926 -0.161634 1.233447 -0.085241 505.754846 2.199341e+04 709.790666 50.606922 11114.233301

#defining the correlation matrix


correlation_matrix = data.corr()
plt.figure(figsize=(10, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix of Player Performance Metrics')
plt.show()

C:\Users\karth\AppData\Local\Temp\ipykernel_40188\1870844869.py:2: FutureWarning: The default value of numeric_


only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns
or specify the value of numeric_only to silence this warning.
correlation_matrix = data.corr()

features = ['hit_exit_speed', 'hit_vertical_angle', 'hit_distance', 'hit_spinrate', 'plate_speed', 'release_speed'


sns.set(style="ticks")

#pairplot to visualize dependencies if any


sns.pairplot(data[features])
plt.suptitle('Pairplot of Baseball Batted Ball Variables', y=1.02)
plt.show()
# Plot release speed vs plate speed colored by year
plt.figure(figsize=(10, 6))
sns.scatterplot(x='release_speed', y='plate_speed', hue='year', data=data)
plt.title('Release Speed vs Plate Speed')
plt.xlabel('Release Speed (mph)')
plt.ylabel('Plate Speed (mph)')
plt.show()
# Plot hit exit speed vs hit distance colored by hit spinrate
plt.figure(figsize=(10, 6))
sns.scatterplot(x='hit_exit_speed', y='hit_distance', hue='hit_spinrate', data=data)
plt.title('Hit Exit Speed vs Hit Distance')
plt.xlabel('Hit Exit Speed (mph)')
plt.ylabel('Hit Distance (feet)')
plt.show()

We can infer that from the plot, there is highly significant correlation between plate_speed and release speed.
We can also see hit_exit_speed and hit_vertical_angle has significant correlation with hit_distance.

# Plot hit distance over years


plt.figure(figsize=(10, 6))
sns.boxplot(x='year', y='hit_distance', data=data)
plt.title('Hit Distance Over Years')
plt.xlabel('Year')
plt.ylabel('Hit Distance')
plt.show()

It can be seen that the average hit distance has marginally increased over the years from 2015 to 2017 and then stayed pretty same.
But this summarizes all event outcomes, lets analyze for each outcome.

from sklearn.preprocessing import StandardScaler


from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
# Step 3: Feature selection
# Select relevant features
features = ['release_speed', 'plate_speed', 'hit_exit_speed', 'hit_spinrate', 'hit_vertical_angle', 'hit_bearing'
target = 'hit_distance'

X = data[features]
y = data[target]

# Standardize the features


scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Step 4: Feature importance using Random Forest


rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_scaled, y)

# Get feature importances


importances = rf_model.feature_importances_
feature_importances = pd.DataFrame({'Feature': features, 'Importance': importances}).sort_values(by='Importance'

# Visualize feature importances


plt.figure(figsize=(10, 6))
sns.barplot(x='Importance', y='Feature', data=feature_importances)
plt.title('Feature Importances for Hitting Distance')
plt.show()
# Step 5: Analyze hitting distance over the years
data['year'] = pd.to_datetime(data['year'], format='%Y')
data['year'] = data['year'].dt.year

# Group by year and calculate mean hitting distance


mean_distance_per_year_event = data.groupby(['year', 'event_result'])['hit_distance'].mean().reset_index()

# Visualize hitting distance over the years


plt.figure(figsize=(12, 6))
sns.lineplot(data=mean_distance_per_year_event, x='year', y='hit_distance', hue='event_result', marker='o')
plt.title('Average Hitting Distance Over the Years')
plt.xlabel('Year')
plt.ylabel('Average Hitting Distance')
plt.grid(True)
plt.show()

# Conclusion
print("Mean Hitting Distance per Year:")
print(mean_distance_per_year)
Mean Hitting Distance per Year:
year
2015 373.854976
2016 387.742645
2017 292.457418
2018 358.723862
2019 307.871089
Name: hit_distance, dtype: float64

#visualizing for home_runs


home_run_data = data[data['event_result'] == 'home_run']

# Group by year and calculate mean hitting distance for home runs
mean_distance_per_year_home_run = home_run_data.groupby('year')['hit_distance'].mean().reset_index()

# Visualize hitting distance over the years for home runs


plt.figure(figsize=(14, 8))
sns.lineplot(data=mean_distance_per_year_home_run, x='year', y='hit_distance', marker='o')
plt.title('Average Hitting Distance Over the Years for Home Runs')
plt.xlabel('Year')
plt.ylabel('Average Hitting Distance')
plt.grid(True)
plt.show()

There is no much variation in the hitting distance when it comes to home_run where drag coefficient plays a tremendous role as the
projectile's sensitivity to air resistance will be significant.
Still will try to use a statistical test to confirm on the same as for a double_pay, there is a big difference.

#visualizing for home_runs


double_play_data = data[data['event_result'] == 'double_play']

# Group by year and calculate mean hitting distance for home runs
mean_distance_per_year_double_play = double_play_data.groupby('year')['hit_distance'].mean().reset_index()

# Visualize hitting distance over the years for home runs


plt.figure(figsize=(14, 8))
sns.lineplot(data=mean_distance_per_year_double_play, x='year', y='hit_distance', marker='o')
plt.title('Average Hitting Distance Over the Years for Double_Play')
plt.xlabel('Year')
plt.ylabel('Average Hitting Distance')
plt.grid(True)
plt.show()
We can see a huge significant difference which might be a cause of data imbalance, hence we go for statistic test.

mean_distance_per_year = data.groupby('year')['hit_distance'].mean()
variance_distance_per_year = data.groupby('year')['hit_distance'].var()

# Perform ANOVA test to see if there are significant differences between years
years = data['year'].unique()
grouped_distances = [data[data['year'] == year]['hit_distance'].values for year in years]
f_value, p_value = stats.f_oneway(*grouped_distances)
print("ANOVA test results: F-value =", f_value, ", P-value =", p_value)

# Conclusion
if p_value < 0.05:
print("There is a significant difference in hitting distances over the years, suggesting that the ball might have
else:
print("There is no significant difference in hitting distances over the years, suggesting that the ball might not

print("\nMean Hitting Distance per Year for Home Runs:")


print(mean_distance_per_year_home_run)

ANOVA test results: F-value = 3.1858058293520846 , P-value = 0.012728353516836105


There is a significant difference in hitting distances over the years, suggesting that the ball might have chan
ged.

Mean Hitting Distance per Year for Home Runs:


year hit_distance
0 2015 408.342609
1 2016 394.806732
2 2017 405.576999
3 2018 397.460663
4 2019 401.684965

We can see that the test concludes there is a significant change overall, yet will check for separate event types.

event_type = 'home_run'
event_data = data[data['event_result'] == event_type]
mean_distance_per_year_event = event_data.groupby('year')['hit_distance'].mean().reset_index()

# Statistical Analysis: Hypothesis Testing


# Calculate mean and variance of hitting distances per year for the specific event
mean_distance_per_year = event_data.groupby('year')['hit_distance'].mean()
variance_distance_per_year = event_data.groupby('year')['hit_distance'].var()

# Perform ANOVA test to see if there are significant differences between years for the specific event
years = event_data['year'].unique()
grouped_distances = [event_data[event_data['year'] == year]['hit_distance'].values for year in years]
f_value, p_value = stats.f_oneway(*grouped_distances)
print(f"ANOVA test results for {event_type}: F-value =", f_value, ", P-value =", p_value)

# Conclusion
if p_value < 0.05:
print(f"There is a significant difference in hitting distances over the years for {event_type}, suggesting that t
else:
print(f"There is no significant difference in hitting distances over the years for {event_type}, suggesting that

ANOVA test results for home_run: F-value = 1.8153742795351 , P-value = 0.12679220333038124


There is no significant difference in hitting distances over the years for home_run, suggesting that the ball m
ight not have changed.

We can infer that for home runs, there is no significant variation which mean the ball has not varied.

event_type = 'triple'
event_data = data[data['event_result'] == event_type]
mean_distance_per_year_event = event_data.groupby('year')['hit_distance'].mean().reset_index()

# Statistical Analysis: Hypothesis Testing


# Calculate mean and variance of hitting distances per year for the specific event
mean_distance_per_year = event_data.groupby('year')['hit_distance'].mean()
variance_distance_per_year = event_data.groupby('year')['hit_distance'].var()

# Perform ANOVA test to see if there are significant differences between years for the specific event
years = event_data['year'].unique()
grouped_distances = [event_data[event_data['year'] == year]['hit_distance'].values for year in years]
f_value, p_value = stats.f_oneway(*grouped_distances)
print(f"ANOVA test results for {event_type}: F-value =", f_value, ", P-value =", p_value)

# Conclusion
if p_value < 0.05:
print(f"There is a significant difference in hitting distances over the years for {event_type}, suggesting that t
else:
print(f"There is no significant difference in hitting distances over the years for {event_type}, suggesting that

ANOVA test results for triple: F-value = 2.370830064743944 , P-value = 0.07853673370434733


There is no significant difference in hitting distances over the years for triple, suggesting that the ball mig
ht not have changed.

For a triple which needs to be hit with a better vertical angle and power travelling through the air also suggests there isn't any
significant change.

Let's calculate the drag coefficient and try to check the variation of the ball.
# Constants
g = 32.174 # gravity in ft/s^2

# Function to calculate drag coefficient with additional variables


def calculate_drag_coefficient(v_exit, theta, distance, spinrate, release_speed, plate_speed):
return (2 * g * (release_speed - plate_speed) * (1 + 0.5 * spinrate) / v_exit**2) * (distance / np.sin(2 * np

# Calculate drag coefficient for each row


data['drag_coefficient'] = calculate_drag_coefficient(data['hit_exit_speed'], data['hit_vertical_angle'], data['hit_d
data['hit_spinrate'], data['release_speed'], data['plate_speed'

data.head()

year month pitcher_throws bat_side pitch_type release_speed plate_speed hit_exit_speed hit_spinrate hit_vertical_angle hit_bearing hit_d

0 2016 7 R R FF 93.433688 85.791840 101.387283 1954.304443 25.563499 -22.539516

1 2016 5 L R FT 89.341958 82.691620 94.986938 5588.018066 60.409538 -46.960789

2 2016 4 R L FF 91.367354 84.554413 80.617020 2264.892334 30.243307 39.408298

4 2016 7 R R FF 91.033388 84.686417 104.878571 1015.863892 12.043263 1.585894

7 2016 9 R L FF 89.689889 82.036316 90.263031 4674.958008 44.270741 -4.610770

data.head()

year month pitcher_throws bat_side pitch_type release_speed plate_speed hit_exit_speed hit_spinrate hit_vertical_angle hit_bearing hit_d

0 2016 7 R R FF 93.433688 85.791840 101.387283 1954.304443 25.563499 -22.539516

1 2016 5 L R FT 89.341958 82.691620 94.986938 5588.018066 60.409538 -46.960789

2 2016 4 R L FF 91.367354 84.554413 80.617020 2264.892334 30.243307 39.408298

4 2016 7 R R FF 91.033388 84.686417 104.878571 1015.863892 12.043263 1.585894

7 2016 9 R L FF 89.689889 82.036316 90.263031 4674.958008 44.270741 -4.610770

import pandas as pd
import numpy as np
from scipy import stats

# Assuming 'data' is your DataFrame with columns 'year' and 'drag_coefficient'


# Calculate mean and variance of drag coefficient per year
mean_drag_coefficient_per_year = data.groupby('year')['drag_coefficient'].mean()
variance_drag_coefficient_per_year = data.groupby('year')['drag_coefficient'].var()

# Perform ANOVA test to see if there are significant differences between years
years = data['year'].unique()
grouped_drag_coefficients = [data[data['year'] == year]['drag_coefficient'].values for year in years]
f_value, p_value = stats.f_oneway(*grouped_drag_coefficients)
print("ANOVA test results: F-value =", f_value, ", P-value =", p_value)

# Conclusion based on P-value


alpha = 0.05
if p_value < alpha:
print("There is a significant difference in drag coefficients over the years.")
else:
print("There is no significant difference in drag coefficients over the years.")

# Print mean drag coefficient per year (optional)


print("\nMean Drag Coefficient per Year:")
print(mean_drag_coefficient_per_year)

ANOVA test results: F-value = 0.68306598982621 , P-value = 0.6036401459643094


There is no significant difference in drag coefficients over the years.

Mean Drag Coefficient per Year:


year
2015 34679.031673
2016 34814.657313
2017 25612.040079
2018 35963.735929
2019 49905.246035
Name: drag_coefficient, dtype: float64

The drag coefficient which we calculated also states that the ball has not varied over the years.

From the analysis and the statistical tests held, we can see at most cases it suggests the ball hasn't varied much and the cases
showing a change could possibly be a data imbalance which moight cause a bias. But on calculating the drag coefficient it is seen
that the ball has not varied.

Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js

You might also like