codeppsjf
codeppsjf
In [ ]:
df.head()
In [ ]:
df.info()
In [ ]:
df.describe(include='all').T
Data Preprocessing
Ensuring that no missing values are present in the dataset
In [ ]:
# Checking for missing values
missing_values = df.isnull().sum()
print("Missing Values:\n", missing_values[missing_values > 0])
Ensuring that no missing values are concealed as question marks.
In [ ]:
df.replace("?", np.nan, inplace=True)
To facilitate subsequent processing, converting all columns into float data type.
In [ ]:
for column in df.columns:
try:
df[column] = df[column].astype(float)
except ValueError:
pass
Simply verify the descriptions of numeric features for absence of missing values and apparent outliers.
In [ ]:
# Selecting numeric columns
numeric_df = df.select_dtypes(include=[np.number])
Feature Engineering
These calculations create new features based on existing columns in the DataFrame, which can potentially
provide additional insights for analysis or modeling purposes.
1. Power = Rotational speed [rpm] * Torque [Nm]: Calculates the power by multiplying the rotational
speed (in revolutions per minute) with the torque (in Newton meters).
2. Power wear = Power * Tool wear [min]: Calculates the power wear by multiplying the power with
the tool wear (in minutes).
3. Temperature difference = Process temperature [K] - Air temperature [K]: Computes the temperature
difference by subtracting the air temperature (in Kelvin) from the process temperature (in Kelvin).
4. Temperature power = Temperature difference / Power: Calculates the temperature power by dividing
the temperature difference by the power.
In [ ]:
# Calculating new features based on the existing columns
df['Power'] = df['Rotational speed [rpm]'] * df['Torque [Nm]']
df['Power wear'] = df['Power'] * df['Tool wear [min]']
df['Temperature difference'] = df['Process temperature [K]'] - df['Air
temperature [K]']
df['Temperature power'] = df['Temperature difference'] / df['Power']
In [ ]:
df.describe(include='all').T
Data Visualizations
In [ ]:
# Counting the occurrences of each machine failure mode
failure_counts = df['Machine failure'].value_counts()
In [ ]:
plt.figure(figsize=(100, 100))
sns.pairplot(df, hue='Machine failure')
plt.suptitle('Relationship between Features with Respect to Machine
Failure')
plt.tight_layout()
plt.show()
In [ ]:
# Calculating the correlation matrix
corr_matrix = df.corr().round(2)
Standardization
By standardizing the features, we ensure that they have a mean of 0 and a standard deviation of 1, making
them suitable for algorithms that assume normally distributed data or require standardized features for
optimal performance. This is crucial because features often have vastly different ranges of values, which
can complicate the model's ability to identify relationships among them.
In [ ]:
# Separating features and target variable
X = df.drop('Machine failure', axis=1)
y = df['Machine failure']
Feature Selection
We are implementing machine learning models using exclusively the features that have been engineered
through our feature engineering process. These features, namely Power, Power wear, Temperature
difference, and Temperature power, have been derived to capture relevant information from the dataset.
Each of these features plays a critical role in understanding and predicting the various failure modes
observed in the manufacturing process.
Specifically, the engineered features encapsulate distinct failure modes such as tool wear failure (TWF),
heat dissipation failure (HDF), power failure (PWF), overstrain failure (OSF), and random failures (RNF).
For instance, TWF occurs when the tool reaches a randomly selected wear time between 200 and 240
minutes, resulting in either replacement or failure. HDF occurs when the temperature difference between
air and process falls below 8.6 K, coupled with a rotational speed below 1380 rpm. PWF manifests when
the power required for the process, derived from the product of torque and rotational speed, falls outside
the range of 3500 W to 9000 W. OSF, on the other hand, is triggered when the product of tool wear and
torque exceeds specific thresholds for different product variants. Additionally, a small percentage of
random failures (RNF) occur independent of process parameters.
In [ ]:
selected_features = ['Power', 'Power wear', 'Temperature difference',
'Temperature power']
X_fs = df[selected_features]
Data Splitting
The dataset will be split into train and test sets by 80:20 ratio
In [ ]:
# Splitting the dataset into train and test sets (80% train, 20% test) for
PCA
X_train_pca, X_test_pca, y_train_pca, y_test_pca = train_test_split(X_pca,
y, test_size=0.2, random_state=42)
In [ ]:
# Splitting the dataset into train and test sets (80% train, 20% test) for
LDA
X_train_lda, X_test_lda, y_train_lda, y_test_lda = train_test_split(X, y,
test_size=0.2, random_state=42)
In [ ]:
# Splitting the dataset into train and test sets (80% train, 20% test) for
Feature Selection
X_train_fs, X_test_fs, y_train_fs, y_test_fs = train_test_split(X_fs, y,
test_size=0.2, random_state=42)
In [ ]:
# Applying RandomOverSampler to address data imbalance only on the training
data of PCA
X_train_resampled_pca, y_train_resampled_pca =
ros.fit_resample(X_train_pca, y_train_pca)
In [ ]:
# Applying RandomOverSampler to address data imbalance only on the training
data of LDA
X_train_resampled_lda, y_train_resampled_lda =
ros.fit_resample(X_train_lda, y_train_lda)
In [ ]:
# Applying Linear Discriminant Analysis (LDA)
lda = LinearDiscriminantAnalysis()
X_train_resampled_lda = lda.fit_transform(X_train_resampled_lda,
y_train_resampled_lda)
X_test_lda = lda.transform(X_test_lda)
Logistic Regression
In [ ]:
# Logistic Regression model with GridSearchCV for hyperparameter tuning
with PCA
logistic_regression = LogisticRegression(max_iter=1000)
param_grid_lr = {'C': [0.001, 0.01, 0.1, 1, 10, 100]}
grid_search_lr = GridSearchCV(estimator=logistic_regression,
param_grid=param_grid_lr, cv=5, scoring='f1_weighted')
In [ ]:
# Fitting GridSearchCV to the training data
grid_search_lr.fit(X_train_resampled_pca, y_train_resampled_pca)
In [ ]:
# Fitting GridSearchCV to the training data
grid_search_lr.fit(X_train_resampled_lda, y_train_resampled_lda)
In [ ]:
# Fitting GridSearchCV to the training data
grid_search_lr.fit(X_train_resampled_fs, y_train_resampled_fs)
Random Forest
In [ ]:
# Random Forest Classifier model with GridSearchCV for hyperparameter
tuning
rf_classifier = RandomForestClassifier(random_state=42)
param_grid_rf = {
'n_estimators': [100, 200, 300],
'max_depth': [None, 10, 20],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4]
}
grid_search_rf = GridSearchCV(estimator=rf_classifier,
param_grid=param_grid_rf, cv=5, scoring='f1_weighted')
In [ ]:
# Fitting GridSearchCV to the training data
grid_search_rf.fit(X_train_resampled_pca, y_train_resampled_pca)
In [ ]:
# Fitting GridSearchCV to the training data
grid_search_rf.fit(X_train_resampled_lda, y_train_resampled_lda)
In [ ]:
# Fitting GridSearchCV to the training data
grid_search_rf.fit(X_train_resampled_fs, y_train_resampled_fs)
Decision Tree
In [ ]:
# Decision Tree Classifier model with GridSearchCV for hyperparameter
tuning
dt_classifier = DecisionTreeClassifier(random_state=42)
param_grid_dt = {
'max_depth': [None, 10, 20],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4]
}
grid_search_dt = GridSearchCV(estimator=dt_classifier,
param_grid=param_grid_dt, cv=5, scoring='f1_weighted')
In [ ]:
# Fitting GridSearchCV to the training data
grid_search_dt.fit(X_train_resampled_pca, y_train_resampled_pca)
In [ ]:
# Fitting GridSearchCV to the training data
grid_search_dt.fit(X_train_resampled_lda, y_train_resampled_lda)
In [ ]:
# Fitting GridSearchCV to the training data
grid_search_dt.fit(X_train_resampled_fs, y_train_resampled_fs)
XGBoost
In [ ]:
xgb_classifier = XGBClassifier(random_state=42)
param_grid_xgb = {
'max_depth': [3, 5, 7],
'learning_rate': [0.01, 0.1, 0.2],
'n_estimators': [100, 200, 300],
'subsample': [0.6, 0.8, 1.0],
'colsample_bytree': [0.6, 0.8, 1.0],
'gamma': [0, 0.1, 0.2]
}
grid_search_xgb = GridSearchCV(estimator=xgb_classifier,
param_grid=param_grid_xgb, cv=5, scoring='f1_weighted')
In [ ]:
# Fitting GridSearchCV to the training data
grid_search_xgb.fit(X_train_resampled_pca, y_train_resampled_pca)
In [ ]:
# Fitting GridSearchCV to the training data
grid_search_xgb.fit(X_train_resampled_lda, y_train_resampled_lda)
In [ ]:
# Fitting GridSearchCV to the training data
grid_search_xgb.fit(X_train_resampled_fs, y_train_resampled_fs)
In [ ]:
# F1-Score weighted average for each model
f1_scores_pca = {
'Logistic Regression': f1_score(y_test_pca, y_pred_lr_pca,
average='weighted'),
'Random Forest': f1_score(y_test_pca, y_pred_rf_pca,
average='weighted'),
'Decision Tree': f1_score(y_test_pca, y_pred_dt_pca,
average='weighted'),
'XGBoost': f1_score(y_test_pca, y_pred_xgb_pca, average='weighted')
}
f1_scores_lda = {
'Logistic Regression': f1_score(y_test_lda, y_pred_lr_lda,
average='weighted'),
'Random Forest': f1_score(y_test_lda, y_pred_rf_lda,
average='weighted'),
'Decision Tree': f1_score(y_test_lda, y_pred_dt_lda,
average='weighted'),
'XGBoost': f1_score(y_test_lda, y_pred_xgb_lda, average='weighted')
}
f1_scores_fs = {
'Logistic Regression': f1_score(y_test_fs, y_pred_lr_fs,
average='weighted'),
'Random Forest': f1_score(y_test_fs, y_pred_rf_fs, average='weighted'),
'Decision Tree': f1_score(y_test_fs, y_pred_dt_fs, average='weighted'),
'XGBoost': f1_score(y_test_fs, y_pred_xgb_fs, average='weighted')
}
# Adding labels
plt.xlabel('Models')
plt.ylabel('F1-Score Weighted Average')
plt.xticks([r + bar_width for r in range(len(models))], models)
plt.ylim(0, 1)
plt.title('F1-Score Weighted Average for Different Models (PCA vs LDA vs
Feature Selection)')
plt.legend()
plt.grid(axis='y', linestyle='--', alpha=0.7)