0% found this document useful (0 votes)
8 views6 pages

MICROECONOMETRICSCV

This project focuses on developing a predictive model for defect occurrence in manufacturing using a dataset of 34,516 observations. By applying Logit and Probit models, the analysis identifies significant variables and evaluates model performance through accuracy, True Positive Rate (TPR), and False Positive Rate (FPR). The Probit model outperforms the Logit model in classification accuracy, making it the preferred choice for this dataset.

Uploaded by

Anh Khoa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views6 pages

MICROECONOMETRICSCV

This project focuses on developing a predictive model for defect occurrence in manufacturing using a dataset of 34,516 observations. By applying Logit and Probit models, the analysis identifies significant variables and evaluates model performance through accuracy, True Positive Rate (TPR), and False Positive Rate (FPR). The Probit model outperforms the Logit model in classification accuracy, making it the preferred choice for this dataset.

Uploaded by

Anh Khoa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

THE JOINT BACHELOR PROGRAM IN APPLIED FINANCE

University of Economics Ho Chi Minh City & University of Rennes

PROJECT:
MICROECONOMETRICS

I. Introduction:
In modern manufacturing, predicting defects is vital to reducing waste, costs, and enhancing
customer satisfaction. This project uses a dataset of 34,516 observations with 14 variables to
develop a predictive model for defect occurrence on production lines. By applying Logit and
Probit models, the aim is to identify significant variables and optimize the model for better
accuracy and performance, ultimately helping businesses improve product quality and efficiency.
Evaluation will be based on metrics like accuracy, True Positive Rate (TPR), and False Positive
Rate (FPR).

II. Analyze the Data Statistically and Graphically:


Objective: Understand the distribution and relationships of the data.

Statistical Summary: Summarize the dataset (mean, median, standard deviation, etc.) for each
explanatory variable (X1 to X13) and the dependent variable (Y). See the data summary table
below:

Statistic Y X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13


Min. 0.000000 101.8 0.0 82.0 99.99 0.00 12.03 0.240 11.97 5.670 0.00 6.30 0.00 5.740
1st Qu. 0.000000 148.7 149.2 149.4 111.04 11.85 12.21 0.380 11.97 6.410 22.28 13.50 94.31 6.420
Median 0.000000 158.0 156.2 158.7 113.16 12.04 12.26 0.390 11.97 6.610 23.88 16.40 98.50 6.610
Mean 0.008837 159.9 156.9 159.6 113.35 11.97 12.26 0.388 11.97 6.548 23.63 17.88 97.70 6.551
3rd Qu. 0.000000 169.3 164.4 168.9 115.38 12.08 12.30 0.410 11.97 6.620 25.29 20.20 102.23 6.610
Max. 1.000000 198.3 196.9 198.1 177.95 12.19 12.50 0.420 11.99 6.670 43.41 84.60 127.30 6.670
NA's - - - - - - - 18627 - - - - - -

See the R Output below:

Call:
glm(formula = Y ~ ., family = binomial, data = data)

Coefficients:

Estimate Std. Error z value Pr(>|z|)


(Intercept) 1.500e+03 8.538e+02 1.757 0.078992 .
X1 3.093e-03 4.825e-03 0.641 0.521559
X2 9.998e-03 1.003e-02 0.997 0.318859
X3 -7.476e-03 4.988e-03 -1.499 0.133948
X4 2.404e-03 2.069e-02 0.116 0.907494

page 1
X5 -4.987e-01 5.229e-01 -0.954 0.340216
X6 -9.620e-01 1.283e+00 -0.750 0.453416
X7 4.207e+00 4.344e+00 0.968 0.332819
X8 -1.212e+02 7.132e+01 -1.699 0.089394 .
X9 4.804e-01 2.521e+00 0.191 0.848888
X10 1.598e-01 4.239e-02 3.770 0.000163 ***
X11 6.175e-03 1.271e-02 0.486 0.627213
X12 1.039e-02 1.390e-02 0.747 0.454980
X13 -7.231e+00 2.369e+00 -3.052 0.002275 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 2103.7 on 15887 degrees of freedom


Residual deviance: 2035.3 on 15874 degrees of freedom
(18627 observations deleted due to missingness)
AIC: 2063.3

Number of Fisher Scoring iterations: 8

Interpret the R results:

The output provided above is a summary of a logistic regression model fitted to the data
meaning that I am predicting the binary outcome Y using all the explanatory variables (X1 to
X13) in the dataset provided, with a logistic regression.

Intercept: The log odds of Y when all predictors are zero. It is very large and not
significant at the usual levels (p-value = 0.079).

Predictors X1, X2, X3, X4, X5, X6, X7, X8, X9, X11, X12 have p-values greater than
0.05, suggesting they are not statistically significant in the mode. They don't provide strong
evidence against the null hypothesis that their coefficients are zero.
X10: Significant predictor (p-value = 0.000163). It has a positive effect on the log odds
of Y, indicating a higher value of X10 increases the likelihood of Y = 1.
X13: Also significant (p-value = 0.002275). It has a large negative effect on the log odds
of Y, suggesting that higher values of X13 decrease the likelihood of Y = 1.
Model Fit Statistics

Null Deviance: 2103.7 (deviance of a model with only the intercept).

Residual Deviance: 2035.3 (deviance of the model with predictors).

Degrees of Freedom: 15887 (null model) and 15874 (residual model), indicating 13
predictors were used.

Page 2
Deviance Reduction: The reduction from null deviance to residual deviance suggests
that the model with predictors is better than the null model, but how much better needs to be
compared to the chi-square distribution.

AIC (Akaike Information Criterion): 2063.3. Lower AIC values indicate a better fit
with fewer predictors, but here it's used to compare different models.

Significance Codes:
‘***’: Highly significant (p < 0.001)
‘**’: Significant (0.001 < p < 0.01)
‘*’: Moderately significant (0.01 < p < 0.05)
‘.’: Marginally significant (0.05 < p < 0.1)
‘ ’: Not significant (p > 0.1)

Graphical Analysis:

1. Check for missing values: If the data contains missing values, they need to be
addressed (by removing, filling in values, etc.) before building the model, as they can bias the
analysis results.

2. Check for outliers: Outliers can negatively impact the model, especially regression
models. Detecting and handling outliers can help improve the model's accuracy.

3. Check for multicollinearity: Use a correlation matrix to identify multicollinearity


between independent variables. If variables have strong correlations (>0.8), consider combining
or removing these variables.
Correlation plot

Page 3
The correlation analysis shows no strong linear relationship between Y and the
independent variables (X1 - X13). X1 and X2 have a high correlation, potentially causing
multicollinearity issues, while X3 and X4 have a moderate correlation. Variables X5 - X8 have
low correlations, reducing multicollinearity risk. Variable selection methods like backward
elimination or AIC/BIC criteria are recommended for model refinement.

Correlation heatmap

The correlation heatmap shows the relationships between the dependent variable Y and the
independent variables (X1 - X13). The color scale ranges from 0 to 1, with lighter colors
representing higher correlations. The heatmap indicates that none of the variables have a strong
correlation with Y, as the values are generally low. However, some variables, like X1 and X2,
show higher correlations with each other, which could indicate potential multicollinearity. This
suggests that further investigation is needed to refine the model.

III. Interpretation of Accuracy, TPR, and FPR


1. Accuracy, TPR, and FPR for Probit Model

Probit Model Accuracy = 0.735


This means that the Probit model correctly classified approximately 73.59% of the observations.

Probit Model TPR = 0.5213


This indicates that the Probit model correctly identified 52.13% of the actual positive cases.

Probit Model FPR = 0.2622


This indicates that 26.22% of the actual negative cases were incorrectly classified as positive by
the Probit model.

Page 4
2. Accuracy, TPR, and FPR for Logit Model

Confusion Matrix:
Predicted FALSE Predicted TRUE
True 0 25,575 8,635
True 1 147 158

Logit Model Accuracy = 0.7456


This indicates that the Logit model correctly classified approximately 74.56% of the
observations. This is slightly higher than the Probit model's accuracy.
Logit Model TPR = 0.518
This indicates that the Logit model correctly identified 51.8% of the actual positive cases. This is
slightly lower than the Probit model's TPR.

Logit Model FPR = 0.2524


This indicates that 25.24% of the actual negative cases were incorrectly classified as positive by
the Logit model. This is lower than the Probit model's FPR.

IV. Summary of the ROC Curves (Logit vs. Probit):

Purpose of the ROC Curve:

- ROC (Receiver Operating Characteristic) curves are used to evaluate the performance of
classification models by plotting the True Positive Rate (TPR) against the False Positive Rate
(FPR) at various threshold levels.

Page 5
- Purpose in this context: The ROC curve helps compare the performance of two models: Logit
and Probit. The closer the curve is to the top-left corner, the better the model's performance in
distinguishing between classes.

Analysis of the ROC Curves:


- Logit vs. Probit Performance:
- Both curves for Logit and Probit are plotted on the same graph.
- Probit Curve: The curve appears slightly closer to the top-left corner compared to the Logit
curve, indicating marginally better performance.
- Logit Curve: While the Logit curve also shows good performance, it slightly underperforms
compared to the Probit model.

V. Conclusion:
Choosing the Best Model: Based on the ROC curves, the Probit model seems to perform
slightly better than the Logit model in this particular dataset, as it has a higher true positive rate
with fewer false positives across different thresholds. However, the difference is not very large,
and both models perform relatively well.
The Probit model is considered the best in this analysis due to its superior performance as
demonstrated by the ROC curve. The curve for the Probit model is consistently closer to the top-
left corner compared to the Logit model, indicating a higher True Positive Rate (TPR) with fewer
False Positives across various thresholds. This suggests that the Probit model has a better ability
to correctly classify positive cases while minimizing errors. Additionally, the Probit model may
offer a better fit for data that follows a normal distribution in the latent variable, further
enhancing its predictive accuracy. Although both models perform well, the Probit model's slight
edge in classification performance makes it the preferred choice for this dataset.

---THE END---

Page 6

You might also like