MICROECONOMETRICSCV
MICROECONOMETRICSCV
PROJECT:
MICROECONOMETRICS
I. Introduction:
In modern manufacturing, predicting defects is vital to reducing waste, costs, and enhancing
customer satisfaction. This project uses a dataset of 34,516 observations with 14 variables to
develop a predictive model for defect occurrence on production lines. By applying Logit and
Probit models, the aim is to identify significant variables and optimize the model for better
accuracy and performance, ultimately helping businesses improve product quality and efficiency.
Evaluation will be based on metrics like accuracy, True Positive Rate (TPR), and False Positive
Rate (FPR).
Statistical Summary: Summarize the dataset (mean, median, standard deviation, etc.) for each
explanatory variable (X1 to X13) and the dependent variable (Y). See the data summary table
below:
Call:
glm(formula = Y ~ ., family = binomial, data = data)
Coefficients:
page 1
X5 -4.987e-01 5.229e-01 -0.954 0.340216
X6 -9.620e-01 1.283e+00 -0.750 0.453416
X7 4.207e+00 4.344e+00 0.968 0.332819
X8 -1.212e+02 7.132e+01 -1.699 0.089394 .
X9 4.804e-01 2.521e+00 0.191 0.848888
X10 1.598e-01 4.239e-02 3.770 0.000163 ***
X11 6.175e-03 1.271e-02 0.486 0.627213
X12 1.039e-02 1.390e-02 0.747 0.454980
X13 -7.231e+00 2.369e+00 -3.052 0.002275 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The output provided above is a summary of a logistic regression model fitted to the data
meaning that I am predicting the binary outcome Y using all the explanatory variables (X1 to
X13) in the dataset provided, with a logistic regression.
Intercept: The log odds of Y when all predictors are zero. It is very large and not
significant at the usual levels (p-value = 0.079).
Predictors X1, X2, X3, X4, X5, X6, X7, X8, X9, X11, X12 have p-values greater than
0.05, suggesting they are not statistically significant in the mode. They don't provide strong
evidence against the null hypothesis that their coefficients are zero.
X10: Significant predictor (p-value = 0.000163). It has a positive effect on the log odds
of Y, indicating a higher value of X10 increases the likelihood of Y = 1.
X13: Also significant (p-value = 0.002275). It has a large negative effect on the log odds
of Y, suggesting that higher values of X13 decrease the likelihood of Y = 1.
Model Fit Statistics
Degrees of Freedom: 15887 (null model) and 15874 (residual model), indicating 13
predictors were used.
Page 2
Deviance Reduction: The reduction from null deviance to residual deviance suggests
that the model with predictors is better than the null model, but how much better needs to be
compared to the chi-square distribution.
AIC (Akaike Information Criterion): 2063.3. Lower AIC values indicate a better fit
with fewer predictors, but here it's used to compare different models.
Significance Codes:
‘***’: Highly significant (p < 0.001)
‘**’: Significant (0.001 < p < 0.01)
‘*’: Moderately significant (0.01 < p < 0.05)
‘.’: Marginally significant (0.05 < p < 0.1)
‘ ’: Not significant (p > 0.1)
Graphical Analysis:
1. Check for missing values: If the data contains missing values, they need to be
addressed (by removing, filling in values, etc.) before building the model, as they can bias the
analysis results.
2. Check for outliers: Outliers can negatively impact the model, especially regression
models. Detecting and handling outliers can help improve the model's accuracy.
Page 3
The correlation analysis shows no strong linear relationship between Y and the
independent variables (X1 - X13). X1 and X2 have a high correlation, potentially causing
multicollinearity issues, while X3 and X4 have a moderate correlation. Variables X5 - X8 have
low correlations, reducing multicollinearity risk. Variable selection methods like backward
elimination or AIC/BIC criteria are recommended for model refinement.
Correlation heatmap
The correlation heatmap shows the relationships between the dependent variable Y and the
independent variables (X1 - X13). The color scale ranges from 0 to 1, with lighter colors
representing higher correlations. The heatmap indicates that none of the variables have a strong
correlation with Y, as the values are generally low. However, some variables, like X1 and X2,
show higher correlations with each other, which could indicate potential multicollinearity. This
suggests that further investigation is needed to refine the model.
Page 4
2. Accuracy, TPR, and FPR for Logit Model
Confusion Matrix:
Predicted FALSE Predicted TRUE
True 0 25,575 8,635
True 1 147 158
- ROC (Receiver Operating Characteristic) curves are used to evaluate the performance of
classification models by plotting the True Positive Rate (TPR) against the False Positive Rate
(FPR) at various threshold levels.
Page 5
- Purpose in this context: The ROC curve helps compare the performance of two models: Logit
and Probit. The closer the curve is to the top-left corner, the better the model's performance in
distinguishing between classes.
V. Conclusion:
Choosing the Best Model: Based on the ROC curves, the Probit model seems to perform
slightly better than the Logit model in this particular dataset, as it has a higher true positive rate
with fewer false positives across different thresholds. However, the difference is not very large,
and both models perform relatively well.
The Probit model is considered the best in this analysis due to its superior performance as
demonstrated by the ROC curve. The curve for the Probit model is consistently closer to the top-
left corner compared to the Logit model, indicating a higher True Positive Rate (TPR) with fewer
False Positives across various thresholds. This suggests that the Probit model has a better ability
to correctly classify positive cases while minimizing errors. Additionally, the Probit model may
offer a better fit for data that follows a normal distribution in the latent variable, further
enhancing its predictive accuracy. Although both models perform well, the Probit model's slight
edge in classification performance makes it the preferred choice for this dataset.
---THE END---
Page 6