python_TUM
python_TUM
For data handling and feature engineering pandas and numpy were
employed. Their main usage was for data loading, cleaning, manipulation,
and feature engineering. One main advantage of Pandas is, that it can
analyze data very efficiently and manipulate libraries. That makes it ideal
for handling structured data.[1] On the other hand, Numpy is a
fundamental package for scientific computing with Python. It provides
support for large, multi-dimensional arrays and matrices. The integration
of these libraries enabled efficient data processing workflows and feature
extraction.[2]
For model building and training, the package scikit-learn was used.
Specifically, the ‘RandomForestClassifier’ was chosen for its robustness
and effectiveness in handling high-dimensional datasets. Another
important feature is its inherent ability to manage feature importance
evaluations. The Random Forest algorithm, as implemented in scikit-learn,
creates multiple decision trees during training and outputs the mode of
the classes for classification tasks. This ensemble method helps in
reducing overfitting and improving the generalization capability of the
model [3]
3. Model Evaluation:
[2] Harris, C. R., Millman, K. J., van der Walt, S. J., Gommers, R., Virtanen,
P., Cournapeau, D., ... & Oliphant, T. E. (2020). Array programming with
NumPy. Nature, 585(7825), 357-362.
[3] Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of
Statistical Learning: Data Mining, Inference, and Prediction (2nd
ed.). Springer. S 587ff.