S 10
S 10
16.01.2025
Ticket 3
1. Feature Engineering: You are developing a model to predict car prices based on features such as
mileage, age, and condition. Explain how you would create new features from this data to improve
model performance.
2. Model Selection: You are tasked with predicting heart disease based on patient health records. How
would you decide between using logistic regression and decision trees for this classification prob-
lem?
3. Cross-Validation: You have developed a machine learning model for housing price prediction and
want to test its robustness. How would you implement k-fold cross-validation to ensure your model
generalizes well to unseen data?
4. Hyperparameter Tuning: You are using XGBoost for a classification task, but your model is overfit-
ting. How would you adjust the learning rate and other hyperparameters to address this issue?
5. Evaluation Metrics: In a binary classification problem with a highly imbalanced dataset, what eval-
uation metric would you prioritize to avoid the problem of misleading accuracy?
Question 1:
Feature Engineering (Car Price Prediction): This question tests your ability to create meaningful features
from raw data to enhance model performance. It focuses on transforming basic car attributes into more
informative predictors.
1. Time-Based Features:
2. Usage-Based Features:
mileage_per_year = total_mileage / car_age This helps normalize mileage across different aged cars
service_frequency = number_of_services / car_age Captures maintenance history relative to age
3. Condition-Related Features:
Convert categorical condition ratings (excellent, good, fair) into numerical scores (5,4,3)
Create binary flags for:
o accident_history (0/1)
o original_paint (0/1)
o single_owner (0/1)
4. Market-Based Features:
5. Technical Specifications:
6. Location-Based Features:
7. Advanced Transformations:
Question 2:
Model Selection (Heart Disease Prediction): This question assesses your understanding of different model
characteristics and how to choose between algorithms (logistic regression vs decision trees) based on spe-
cific problem requirements in healthcare.
We should select between logistic regression and decision trees for heart disease prediction.
1. Interpretability
o Provides clear coefficient weights for each feature
o Doctors can understand how each medical factor contributes to the prediction
o Enables transparent communication of risk factors to patients
2. Probabilistic Output
o Gives probability scores for risk assessment
o Allows setting different probability thresholds based on clinical needs
o Useful for risk stratification in patient care
3. Statistical Properties
o Works well with smaller medical datasets
o Provides confidence intervals for predictions
o Less prone to overfitting compared to complex trees
o Good with linearly separable data
1. Nonlinear Relationships
o Can capture complex interactions between symptoms
o No assumption about feature relationships
o Handles both numerical and categorical data naturally
2. Missing Data Handling
o Can work with incomplete patient records
o Built-in mechanisms for handling missing values
o No need for extensive data preprocessing
3. Feature Importance
o Provides clear visualization of decision paths
o Can identify critical diagnostic factors
o Easy to convert into clinical decision rules
Final Recommendation: For heart disease prediction, I would choose Logistic Regression because:
Question 3:
Cross-Validation: This question evaluates your knowledge of model validation techniques, specifically how
to implement k-fold cross-validation to ensure your housing price prediction model performs consistently
across different subsets of data.
# Make predictions
predictions = model.predict(X_val)
# Store results
mse_scores.append(mse)
r2_scores.append(r2)
fold_predictions.append((val_index, predictions))
return {
'avg_mse': avg_mse,
'std_mse': std_mse,
'avg_r2': avg_r2,
'fold_predictions': fold_predictions
}
1. Data Splitting:
o Use k=5 folds (standard practice)
o Shuffle data before splitting (random_state for reproducibility)
o Maintain data order tracking for later analysis
2. Performance Metrics:
o MSE (Mean Squared Error) for error magnitude
o R² score for explained variance
o Standard deviation of scores for robustness assessment
3. Validation Process:
o Each data point appears in test set exactly once
o Model trained k times on different train/test splits
o Results averaged across all folds
Additional Considerations:
# Example usage
from sklearn.linear_model import LinearRegression
# Run cross-validation
results = implement_kfold_cv(X_housing, y_housing, model)
print(f"Average MSE: {results['avg_mse']:.2f} (+/- {results['std_mse']:.2f})")
print(f"Average R²: {results['avg_r2']:.2f}")
# Perform k-fold CV
for fold, (train_idx, val_idx) in enumerate(kf.split(X)):
# Split data
X_train, X_val = X[train_idx], X[val_idx]
y_train, y_val = y[train_idx], y[val_idx]
# Train model
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_val)
# Calculate metrics
mse = mean_squared_error(y_val, y_pred)
mse_scores.append(mse)
# Initialize cross-validation
kf = KFold(n_splits=k, shuffle=True, random_state=42)
scores = []
for train_idx, val_idx in kf.split(X):
# Split data
X_train, X_val = X[train_idx], X[val_idx]
y_train, y_val = y[train_idx], y[val_idx]
# Calculate error
mse = mean_squared_error(y_val, y_pred)
scores.append(mse)
4. Best Practices:
5. Error Analysis:
def analyze_cv_results(model, X, y, k=5):
kf = KFold(n_splits=k, shuffle=True, random_state=42)
fold_errors = []
model.fit(X_train, y_train)
y_pred = model.predict(X_val)
# Calculate residuals
residuals = y_val - y_pred
fold_errors.append(residuals)
return fold_errors
Question 4:
Hyperparameter Tuning (XGBoost): This question tests your ability to fine-tune a gradient boosting model
by adjusting various parameters to combat overfitting, requiring understanding of how different
hyperparameters affect model behavior.
params = {
# Reduce model complexity
'max_depth': 4, # Decrease from default 6
'min_child_weight': 5, # Increase to require more observations per leaf
# Add randomness
'subsample': 0.8, # Use 80% of data per tree
'colsample_bytree': 0.8, # Use 80% of features per tree
# Increase regularization
'reg_alpha': 0.1, # L1 regularization
'reg_lambda': 2.0 # L2 regularization
}
# Initialize model
model = XGBRegressor(**params)
return model
model = XGBRegressor()
grid_search = GridSearchCV(
estimator=model,
param_grid=param_grid,
cv=5,
scoring='neg_mean_squared_error',
n_jobs=-1
)
grid_search.fit(X, y)
return grid_search.best_params_
# Validation metrics
val_pred = model.predict(X_val)
val_rmse = np.sqrt(mean_squared_error(y_val, val_pred))
Question 5:
Evaluation Metrics (Imbalanced Classification): This question assesses your knowledge of appropriate met-
rics for evaluating binary classification models when dealing with imbalanced datasets, where traditional
accuracy might be misleading.
I will explain the evaluation metrics for imbalanced binary classification datasets:
2. Supporting Metrics:
a) F1-Score:
5. Additional Considerations:
if f1 > best_f1:
best_f1 = f1
best_threshold = threshold
return best_threshold
This comprehensive approach ensures proper evaluation of model performance on imbalanced datasets,
focusing on metrics that matter for the minority class while maintaining overall model quality.