Code Book
Code Book
Linear Statistical
Models and
Regression Analysis
LAB
(ED212)
OUTPUTS
Explanation:
1.Data Import and Exploration:
- The code first imports necessary libraries (NumPy, Pandas, Matplotlib, Statsmodels, and scikit-learn) and reads an Excel
file containing MBA salary data.
- It prints the first 10 rows of the dataset and provides information about the dataset, including the data types and non-
null counts.
2. Data Visualization:
- It creates a scatter plot to visualize the relationship between "Percentage in Grade 10" and "Salary."
3. Defining Features (X) and Target (Y):
- It prepares the data for regression by defining the feature variable (X) as "Percentage in Grade 10" and adding a constant
term.
- The target variable (Y) is set as "Salary."
4. **Training and Testing Data Split:**
- The code splits the data into training and testing sets using the train_test_split function from scikit-learn. 80% of the
data is used for training, and 20% is used for testing. A random seed (random_state) is set to ensure reproducibility.
OUTPUTS
Explanation:
1. **Data Overview:**
- The code begins by importing the necessary libraries and reading a dataset from an Excel file named "DAD Hospital
DATA.xlsx."
- It prints the first 10 rows of the dataset and provides information about the dataset, including the data types and non-
null counts.
2. **Data Visualization:**
- A scatter plot is created to visualize the relationship between "BodyWeight" (independent variable) and
"CostofTreatment" (dependent variable).
- The code prepares the data for regression by defining the feature variable (X) as "BodyWeight" and adding a constant
term.
- The dataset is split into training and testing sets using the train_test_split function from scikit-learn. 90% of the data is
used for training, and 10% is used for testing. A random seed (random_state) is set for reproducibility.
- The model parameters, including the intercept and coefficient for "BodyWeight," are printed.
- The regression summary provides information about the model's performance and statistical significance:
- R-squared: This is a measure of how well the independent variable ("BodyWeight") explains the variation in the
dependent variable ("CostofTreatment"). An R-squared of 0.048 suggests that only 4.8% of the variance in the cost of
treatment can be explained by body weight.
- Coefficients: The coefficients show the estimated impact of "BodyWeight" on "CostofTreatment." The const coefficient
represents the intercept, and the "BodyWeight" coefficient indicates the change in the cost of treatment for a one-unit
change in body weight.
- P-values: P-values indicate the statistical significance of the coefficients. A p-value of 0.0228 for "BodyWeight" suggests
that it is statistically significant at a 0.05 significance level.
In summary, the analysis suggests that "BodyWeight" has a limited explanatory power in predicting the "CostofTreatment."
The low R-squared value indicates that body weight explains only a small portion of the variation in treatment costs. The
coefficient for "BodyWeight" is statistically significant, but the overall model fit is not very strong.