Python TUM

Uploaded by

rohahu9

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1 views3 pages

Python TUM

Uploaded by

rohahu9

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 3

The model development process involved several methods, leveraging

well-established machine learning and data processing libraries. The

primary libraries utilized include pandas, numpy, matplotlib, and scikit-
learn.

1. Data Handling and Feature Engineering:

For data handling and feature engineering pandas and numpy were
employed. Their main usage was for data loading, cleaning, manipulation,
and feature engineering. One main advantage of Pandas is, that it can
analyze data very efficiently and manipulate libraries. That makes it ideal
for handling structured data.[1] On the other hand, Numpy is a
fundamental package for scientific computing with Python. It provides
support for large, multi-dimensional arrays and matrices. The integration
of these libraries enabled efficient data processing workflows and feature
extraction.[2]

Feature Engineering was a crucial step in this model development process.

Raw accelerometer data, characterized by three-dimensional vectors (x, y,
z), was transformed into meaningful features. This transformation included
statistical metrics such as mean and standard deviation calculated over
sliding windows, as well as the computation of the magnitude of
acceleration vectors. Through this process we could capture the temporal
dynamics and variability of the motion data, this was a pivotal step to
distinguishing between different states or activities.

2. Model Building and Training:

For model building and training, the package scikit-learn was used.
Specifically, the ‘RandomForestClassifier’ was chosen for its robustness
and effectiveness in handling high-dimensional datasets. Another
important feature is its inherent ability to manage feature importance
evaluations. The Random Forest algorithm, as implemented in scikit-learn,
creates multiple decision trees during training and outputs the mode of
the classes for classification tasks. This ensemble method helps in
reducing overfitting and improving the generalization capability of the
model [3]

3. Model Evaluation:

Cross-Validation was utilized to evaluate the model’s performance. By

splitting the data into multiple folds and iteratively training and testing the
model on these folds. This ensured a reliable estimate of the model’s
accuracy and minimized the risks of overfitting. The use
of cross_val_score from scikit-learn facilitated this process by automating
the partitioning of data, model training, and performance scoring.[4]

Challenges and Model Building Process

Several challenges were encountered during the development of this
model, which were systematically addressed to enhance model
performance and reliability.

1. Data Quality and Preprocessing

Challenge: 1) The accelerometer and TAC (Transdermal Alcohol Content)

data were collected from different sensors and had varying timestamps
and formats. Additionally, there were inconsistencies such as missing data,
noise, and non-uniform sampling rates.
2) The size of the dataset was too large to compute the model on the
whole data. Since there were more than 14 million records, we were not
able to consider all the records.
Solution: 1) To address these challenges, data loading functions were
developed to handle errors such as file not found, empty files, or parsing
issues. Data alignment was a critical step, where timestamps were
converted to a uniform format and data was synchronized based on the
nearest timestamp. This ensured that the features extracted from
accelerometer data corresponded accurately to the TAC readings.
2) We implemented a method which created a sample of our dataset.
Through that we were able to make the data set smaller and build our
model. We looked at 1 million, 100 thousand and 10 thousand records and
created a plot. We could see a similar trend in all of the three options.

2. Feature Extraction and Dimensionality Reduction:

Challenge: Extracting relevant features from raw accelerometer data was

essential for building a predictive model. However, working with raw
signals directly is often infeasible due to noise and high dimensionality.
Solution: Feature engineering was performed to extract meaningful
statistical metrics over defined windows, capturing the behavior of the
sensor data over time. Furthermore, feature importance was evaluated
using the RandomForestClassifier’s ‘feature_importances_’ attribute, which
helped in identifying and retaining the most significant features for model
training. This was crucial in reducing the dimensionality of the dataset,
enhancing the model’s efficiency, and avoiding the curse of
dimensionality.

3. Model Evaluation and Tuning:

Challenge: Ensuring the model’s generalizability to unseen data is always

a challenge in machine learning.
Solution: To tackle this, cross-validation was performed across different
configurations and hyperparameters of the RandomForestClassifier. The
Accuracy vs. Fraction of Most Important Features Kept plot helped visualize
how model accuracy changes as different proportions of the most
important features are used. It enabled fine-tuning of the feature set to
optimize model performance.
[1] McKinney, W. (2010). Data structures for statistical computing in
Python. Proceedings of the 9th Python in Science Conference, 51-56.

[2] Harris, C. R., Millman, K. J., van der Walt, S. J., Gommers, R., Virtanen,
P., Cournapeau, D., ... & Oliphant, T. E. (2020). Array programming with
NumPy. Nature, 585(7825), 357-362.

[3] Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of
Statistical Learning: Data Mining, Inference, and Prediction (2nd
ed.). Springer. S 587ff.