0% found this document useful (0 votes)
21 views20 pages

Code Book

This code book contains 4 Python codes for performing linear regression analysis on different datasets. The first code analyzes the relationship between percentage in grade 10 and MBA salary. The second code examines the relationship between body weight and hospital treatment costs. The third code performs multiple linear regression on an IPL dataset to predict player sold prices. It encodes categorical variables, builds the model on a training set, and analyzes residuals. The fourth code sets up multiple linear regression on a car prices dataset to predict prices based on vehicle features.

Uploaded by

ABHISHEK DAS
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views20 pages

Code Book

This code book contains 4 Python codes for performing linear regression analysis on different datasets. The first code analyzes the relationship between percentage in grade 10 and MBA salary. The second code examines the relationship between body weight and hospital treatment costs. The third code performs multiple linear regression on an IPL dataset to predict player sold prices. It encodes categorical variables, builds the model on a training set, and analyzes residuals. The fourth code sets up multiple linear regression on a car prices dataset to predict prices based on vehicle features.

Uploaded by

ABHISHEK DAS
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 20

CODE BOOK

Linear Statistical
Models and
Regression Analysis
LAB
(ED212)

Priyansh Pratap Singh


Roll No: IED/10016/22

CODE 1: Linear Regression Analysis of MBA Salary


Data
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
mba_salary_df = pd.read_csv(r"C:\Users\Abhishek\Downloads\MBA Salary.csv")
print(mba_salary_df.head(10))
print(mba_salary_df.info())
plt.scatter(mba_salary_df['Percentage in Grade 10'] , mba_salary_df['Salary']);
#defining x and y feature
x = sm.add_constant(mba_salary_df['Percentage in Grade 10'])
print(x.head())
y = mba_salary_df['Salary']
print(y.head())
x_train , x_test , y_train, y_test = train_test_split(x , y , test_size=0.2 , random_state = 32)
print(x_train , "\n" , x_test)
#fitting OLS
mba_salary_lm = sm.OLS(y_train , x_train).fit()
print(mba_salary_lm.params)
print(mba_salary_lm.summary2())

OUTPUTS
Explanation:
1.Data Import and Exploration:
- The code first imports necessary libraries (NumPy, Pandas, Matplotlib, Statsmodels, and scikit-learn) and reads an Excel
file containing MBA salary data.
- It prints the first 10 rows of the dataset and provides information about the dataset, including the data types and non-
null counts.
2. Data Visualization:
- It creates a scatter plot to visualize the relationship between "Percentage in Grade 10" and "Salary."
3. Defining Features (X) and Target (Y):
- It prepares the data for regression by defining the feature variable (X) as "Percentage in Grade 10" and adding a constant
term.
- The target variable (Y) is set as "Salary."
4. **Training and Testing Data Split:**
- The code splits the data into training and testing sets using the train_test_split function from scikit-learn. 80% of the
data is used for training, and 20% is used for testing. A random seed (random_state) is set to ensure reproducibility.

5. **Fitting Ordinary Least Squares (OLS) Regression:**


- It fits a linear regression model (Ordinary Least Squares) using the training data, where "Percentage in Grade 10" is used
to predict "Salary."
- The model parameters and summary statistics are printed.
6. **Interpretation of the Regression Summary:**
- The regression summary provides information about the model's performance and statistical significance:
- R-squared: This is a measure of how well the independent variable ("Percentage in Grade 10") explains the variation in
the dependent variable ("Salary"). An R-squared of 0.092 indicates that only 9.2% of the variance in salary can be explained
by the percentage in Grade 10.
- Coefficients: The coefficients show the estimated impact of "Percentage in Grade 10" on "Salary." The const coefficient
represents the intercept, and the "Percentage in Grade 10" coefficient indicates the change in salary for a one-unit change
in the grade percentage.
- P-values: P-values indicate the statistical significance of the coefficients. A p-value of 0.0577 for "Percentage in Grade
10" suggests that it is marginally significant at a 0.05 significance level.
- Confidence Intervals: The confidence intervals provide a range within which the true population parameters are likely
to fall.
In summary, the analysis suggests that "Percentage in Grade 10" has a limited explanatory power in predicting "Salary."
The low R-squared value and the marginally significant p-value for the coefficient of "Percentage in Grade 10" indicate that
there may be other factors influencing salary that are not captured by this model.
CODE 2: Linear Regression Analysis of Hospital
Treatment Costs
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
cost_treatment_df = pd.read_excel(r"C:\Users\Abhishek\Downloads\DAD Hospital
DATA.xlsx")
#defining x and y feature
x = sm.add_constant(cost_treatment_df['BodyWeight'])
print(x.head())
y = cost_treatment_df['CostofTreatment']
print(y.head())
x_train , x_test , y_train, y_test = train_test_split(x , y , test_size = 0.2 , random_state = 42)
print(x_train , "\n" , x_test)
#fitting OLS
cost_treatment_lm = sm.OLS(y_train , x_train).fit()
print(cost_treatment_lm.params)
print(cost_treatment_lm.summary2())

OUTPUTS
Explanation:
1. **Data Overview:**

- The code begins by importing the necessary libraries and reading a dataset from an Excel file named "DAD Hospital
DATA.xlsx."

- It prints the first 10 rows of the dataset and provides information about the dataset, including the data types and non-
null counts.

2. **Data Visualization:**

- A scatter plot is created to visualize the relationship between "BodyWeight" (independent variable) and
"CostofTreatment" (dependent variable).

3. **Defining Features (X) and Target (Y):**

- The code prepares the data for regression by defining the feature variable (X) as "BodyWeight" and adding a constant
term.

- The target variable (Y) is set as "CostofTreatment."

4. **Training and Testing Data Split:**

- The dataset is split into training and testing sets using the train_test_split function from scikit-learn. 90% of the data is
used for training, and 10% is used for testing. A random seed (random_state) is set for reproducibility.

5. **Fitting Ordinary Least Squares (OLS) Regression:**


- A linear regression model (Ordinary Least Squares) is fitted using the training data, where "BodyWeight" is used to
predict "CostofTreatment."

- The model parameters, including the intercept and coefficient for "BodyWeight," are printed.

6. **Interpretation of the Regression Summary:**

- The regression summary provides information about the model's performance and statistical significance:

- R-squared: This is a measure of how well the independent variable ("BodyWeight") explains the variation in the
dependent variable ("CostofTreatment"). An R-squared of 0.048 suggests that only 4.8% of the variance in the cost of
treatment can be explained by body weight.

- Coefficients: The coefficients show the estimated impact of "BodyWeight" on "CostofTreatment." The const coefficient
represents the intercept, and the "BodyWeight" coefficient indicates the change in the cost of treatment for a one-unit
change in body weight.

- P-values: P-values indicate the statistical significance of the coefficients. A p-value of 0.0228 for "BodyWeight" suggests
that it is statistically significant at a 0.05 significance level.

In summary, the analysis suggests that "BodyWeight" has a limited explanatory power in predicting the "CostofTreatment."
The low R-squared value indicates that body weight explains only a small portion of the variation in treatment costs. The
coefficient for "BodyWeight" is statistically significant, but the overall model fit is not very strong.

CODE 3: IPL Data


# Multiple Linear Regression through IPL
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
ipl_df = pd.read_csv(r"C:\Users\Abhishek\Python\Regression Analysis\IPL
IMB381IPL2013.csv")
plt.scatter(ipl_df['AVE'] , ipl_df['SOLD PRICE'])
# Defining x and y features
ipl_df.columns # To display column names
x_features = ['AGE','COUNTRY', 'PLAYING ROLE','T-RUNS', 'T-WKTS', 'ODI-RUNS-S', 'ODI-
SR-B', 'ODI-WKTS', 'ODI-SR-BL', 'CAPTAINCY EXP','RUNS-S', 'HS', 'AVE', 'SR-B',
'SIXERS', 'RUNS-C', 'WKTS', 'AVE-BL', 'ECON', 'SR-BL']
# Encoding categorical variables
ipl_df['PLAYING ROLE'].unique() # For displaying the levels
pd.get_dummies(ipl_df['PLAYING ROLE'])[0:5] # For displaying dummies of the variable
'PLAYING ROLE'
# Defining the set of categorical variables
categ = ['AGE' , 'COUNTRY' , 'PLAYING ROLE' , 'CAPTAINCY EXP']
# Dummies to be included in a new dataframe
ipl_encoded_df = pd.get_dummies(ipl_df[x_features], columns = categ, drop_first = True)
ipl_encoded_df.columns # For displaying columns of new encoded dataframe
x_features = ipl_encoded_df.columns
# Adding constant to x_featurese, defining x and y for regression
x = sm.add_constant(ipl_encoded_df)
y = ipl_df['SOLD PRICE']
# Training-testing split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state = 42)
print(x_train)
# Building the model on the training dataset
ipl_lm = sm.OLS(y_train,x_train).fit()
ipl_lm.summary2()
# Regression with significant variables
signi_var = ['HS' , 'AVE' , 'AGE_2' , 'AGE_3' , 'COUNTRY_BAN' , 'COUNTRY_ENG' ,
'COUNTRY_IND' ,
'COUNTRY_NZ' , 'COUNTRY_PAK' , 'COUNTRY_SA' , 'COUNTRY_SL' ,
'COUNTRY_WI' , 'COUNTRY_ZIM']
x_train = x_train[signi_var]
ipl_model2 = sm.OLS(y_train,x_train).fit()
ipl_model2.summary2()
# Residual Analysis: Checking normality assumption and homoscedasticity
# Q-Q plot: Quantile-quantile plot between theoretical CDF of normal ditribution and CDF
of residuals.
sm.qqplot(ipl_model2.resid , line = 's');
import seaborn as sns
sns.histplot(ipl_model2.resid)
# Residual Plot for Homoscedasticity
# Residual Plot is a plot between standardized fitted values (X-axis) and residuals (Y-axis).
def get_standardized_values(vals):
return (vals - vals.mean()) / vals.std()
plt.scatter(get_standardized_values(ipl_model2.fittedvalues),get_standardized_values(ip
l_model2.resid))
OUTPUTS
CODE 4: CARS PRICES
# Multiple Linear Regression with Car Price
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
cprice_df=pd.read_csv(r"C:\Users\Abhishek\Python\Regression Analysis\CAR PRICES\
CarPrice_Assignment.csv")
print(cprice_df.info())
# Defining x and y features
cprice_df.columns # To display column names
x_features = ['symboling', 'fueltype' , 'aspiration' , 'doornumber' , 'carbody' ,
'drivewheel' , 'enginelocation' , 'wheelbase' , 'carlength' , 'carwidth' , 'carheight' ,
'curbweight' ,
'enginetype' , 'cylindernumber' , 'enginesize' , 'fuelsystem' , 'boreratio' , 'stroke' ,
'compressionratio' , 'horsepower' , 'peakrpm' , 'citympg' , 'highwaympg']
# Defining the set of categorical variables
categ = ['fueltype' , 'aspiration' , 'doornumber' , 'carbody' , 'drivewheel' ,
'enginelocation' , 'enginetype' , 'cylindernumber' , 'fuelsystem']
# Dummies to be included in a new dataframe
cprice_encoded_df = pd.get_dummies(cprice_df[x_features] , columns = categ , drop_first
= True)
cprice_encoded_df.columns # For displaying columns of new encoded dataframe
# Reassigning x_features
x_features = cprice_encoded_df.columns

# Adding constant to x_featurese, defining x and y for regression


x = sm.add_constant(cprice_encoded_df)
y = cprice_df['price']
# Training-testing split
x_train , x_test , y_train , y_test = train_test_split(x , y , test_size = 0.2 , random_state = 42)
# Building the model on the training dataset
cprice_lm = sm.OLS(y_train,x_train).fit()
cprice_lm.summary2()
OUTPUTS

You might also like