0% found this document useful (0 votes)
7 views11 pages

S 10

The document outlines various machine learning concepts including feature engineering for car price prediction, model selection for heart disease prediction, k-fold cross-validation for housing price prediction, hyperparameter tuning for XGBoost, and evaluation metrics for imbalanced classification. It provides detailed methodologies and code snippets for implementing these concepts effectively. Each section emphasizes the importance of model interpretability, robustness, and appropriate evaluation metrics in machine learning applications.

Uploaded by

gunelaslanova106
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views11 pages

S 10

The document outlines various machine learning concepts including feature engineering for car price prediction, model selection for heart disease prediction, k-fold cross-validation for housing price prediction, hyperparameter tuning for XGBoost, and evaluation metrics for imbalanced classification. It provides detailed methodologies and code snippets for implementing these concepts effectively. Each section emphasizes the importance of model interpretability, robustness, and appropriate evaluation metrics in machine learning applications.

Uploaded by

gunelaslanova106
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

E27-24

16.01.2025

Ticket 3

1. Feature Engineering: You are developing a model to predict car prices based on features such as
mileage, age, and condition. Explain how you would create new features from this data to improve
model performance.

2. Model Selection: You are tasked with predicting heart disease based on patient health records. How
would you decide between using logistic regression and decision trees for this classification prob-
lem?

3. Cross-Validation: You have developed a machine learning model for housing price prediction and
want to test its robustness. How would you implement k-fold cross-validation to ensure your model
generalizes well to unseen data?

4. Hyperparameter Tuning: You are using XGBoost for a classification task, but your model is overfit-
ting. How would you adjust the learning rate and other hyperparameters to address this issue?

5. Evaluation Metrics: In a binary classification problem with a highly imbalanced dataset, what eval-
uation metric would you prioritize to avoid the problem of misleading accuracy?

Question 1:

Feature Engineering (Car Price Prediction): This question tests your ability to create meaningful features
from raw data to enhance model performance. It focuses on transforming basic car attributes into more
informative predictors.

Feature Engineering for Car Price Prediction:

Here's how I would create new features to improve model performance:

1. Time-Based Features:

 Create car_age = current_year - manufacture_year


 Calculate months_since_last_service
 Generate quarter_of_manufacture to capture seasonal effects

2. Usage-Based Features:

 mileage_per_year = total_mileage / car_age This helps normalize mileage across different aged cars
 service_frequency = number_of_services / car_age Captures maintenance history relative to age

3. Condition-Related Features:

 Convert categorical condition ratings (excellent, good, fair) into numerical scores (5,4,3)
 Create binary flags for:
o accident_history (0/1)
o original_paint (0/1)
o single_owner (0/1)

4. Market-Based Features:

 brand_premium = average_brand_price / average_market_price


 model_popularity = sales_volume / total_market_volume
 depreciation_rate = (original_price - current_price) / original_price

5. Technical Specifications:

 power_to_weight_ratio = engine_power / vehicle_weight


 fuel_efficiency_score = miles_per_gallon * engine_size
 Create interaction terms between engine size and fuel type

6. Location-Based Features:

 Create demographic features based on registration location


 Generate climate-based wear indicators based on geographic location

7. Advanced Transformations:

 Apply log transformation to mileage and price to handle skewness


 Create polynomial features for age and mileage (age², mileage²)
 Generate interaction terms between key features like age*mileage

Question 2:

Model Selection (Heart Disease Prediction): This question assesses your understanding of different model
characteristics and how to choose between algorithms (logistic regression vs decision trees) based on spe-
cific problem requirements in healthcare.

We should select between logistic regression and decision trees for heart disease prediction.

Model Selection for Heart Disease Prediction:

Let's analyze both options systematically:

Logistic Regression Benefits:

1. Interpretability
o Provides clear coefficient weights for each feature
o Doctors can understand how each medical factor contributes to the prediction
o Enables transparent communication of risk factors to patients
2. Probabilistic Output
o Gives probability scores for risk assessment
o Allows setting different probability thresholds based on clinical needs
o Useful for risk stratification in patient care
3. Statistical Properties
o Works well with smaller medical datasets
o Provides confidence intervals for predictions
o Less prone to overfitting compared to complex trees
o Good with linearly separable data

Decision Trees Benefits:

1. Nonlinear Relationships
o Can capture complex interactions between symptoms
o No assumption about feature relationships
o Handles both numerical and categorical data naturally
2. Missing Data Handling
o Can work with incomplete patient records
o Built-in mechanisms for handling missing values
o No need for extensive data preprocessing
3. Feature Importance
o Provides clear visualization of decision paths
o Can identify critical diagnostic factors
o Easy to convert into clinical decision rules

Final Recommendation: For heart disease prediction, I would choose Logistic Regression because:

1. Medical decisions require interpretable models


2. Feature relationships in medical data are often studied and relatively linear
3. Need for probability scores for risk assessment
4. Smaller medical datasets benefit from simpler models
5. Regulatory compliance often prefers interpretable models
6. Easy to update and maintain as new data arrives

Question 3:

Cross-Validation: This question evaluates your knowledge of model validation techniques, specifically how
to implement k-fold cross-validation to ensure your housing price prediction model performs consistently
across different subsets of data.

from sklearn.model_selection import KFold


from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

def implement_kfold_cv(X, y, model, k=5):


# Initialize K-Fold cross-validator
kf = KFold(n_splits=k, shuffle=True, random_state=42)

# Lists to store performance metrics


mse_scores = []
r2_scores = []
fold_predictions = []

# Iterate through each fold


for fold, (train_index, val_index) in enumerate(kf.split(X)):
# Split data into training and validation sets
X_train, X_val = X[train_index], X[val_index]
y_train, y_val = y[train_index], y[val_index]

# Train the model


model.fit(X_train, y_train)

# Make predictions
predictions = model.predict(X_val)

# Calculate performance metrics


mse = mean_squared_error(y_val, predictions)
r2 = r2_score(y_val, predictions)

# Store results
mse_scores.append(mse)
r2_scores.append(r2)
fold_predictions.append((val_index, predictions))

print(f"Fold {fold+1}: MSE = {mse:.2f}, R2 = {r2:.2f}")

# Calculate aggregate metrics


avg_mse = np.mean(mse_scores)
std_mse = np.std(mse_scores)
avg_r2 = np.mean(r2_scores)

return {
'avg_mse': avg_mse,
'std_mse': std_mse,
'avg_r2': avg_r2,
'fold_predictions': fold_predictions
}

Key Implementation Details:

1. Data Splitting:
o Use k=5 folds (standard practice)
o Shuffle data before splitting (random_state for reproducibility)
o Maintain data order tracking for later analysis
2. Performance Metrics:
o MSE (Mean Squared Error) for error magnitude
o R² score for explained variance
o Standard deviation of scores for robustness assessment
3. Validation Process:
o Each data point appears in test set exactly once
o Model trained k times on different train/test splits
o Results averaged across all folds

Additional Considerations:

1. Stratification for price ranges (if needed)


2. Preprocessing within each fold to prevent data leakage
3. Feature scaling/normalization inside cross-validation loop
4. Handling of outliers consistently across folds

# Example usage
from sklearn.linear_model import LinearRegression

# Initialize your model


model = LinearRegression()

# Run cross-validation
results = implement_kfold_cv(X_housing, y_housing, model)
print(f"Average MSE: {results['avg_mse']:.2f} (+/- {results['std_mse']:.2f})")
print(f"Average R²: {results['avg_r2']:.2f}")

Also we can show some advance level examples like that:

from sklearn.model_selection import KFold


from sklearn.metrics import mean_squared_error
import numpy as np

def implement_kfold_cv(X, y, model, k=5):


# Initialize KFold
kf = KFold(n_splits=k, shuffle=True, random_state=42)

# Lists to store scores


mse_scores = []
r2_scores = []

# Perform k-fold CV
for fold, (train_idx, val_idx) in enumerate(kf.split(X)):
# Split data
X_train, X_val = X[train_idx], X[val_idx]
y_train, y_val = y[train_idx], y[val_idx]

# Train model
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_val)

# Calculate metrics
mse = mean_squared_error(y_val, y_pred)
mse_scores.append(mse)

print(f"Fold {fold+1} MSE: {mse:.2f}")

# Calculate average scores


avg_mse = np.mean(mse_scores)
std_mse = np.std(mse_scores)

return avg_mse, std_mse


2. Key Components to Consider:

 Stratification: For maintaining distribution of target variable


 Shuffling: To ensure random distribution of data
 Number of folds: Typically 5 or 10 depending on dataset size

3. Enhanced Version with Data Preprocessing:

from sklearn.preprocessing import StandardScaler


from sklearn.pipeline import Pipeline

def enhanced_cv(X, y, model, k=5):


# Create pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('model', model)
])

# Initialize cross-validation
kf = KFold(n_splits=k, shuffle=True, random_state=42)

scores = []
for train_idx, val_idx in kf.split(X):
# Split data
X_train, X_val = X[train_idx], X[val_idx]
y_train, y_val = y[train_idx], y[val_idx]

# Fit pipeline and predict


pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_val)

# Calculate error
mse = mean_squared_error(y_val, y_pred)
scores.append(mse)

return np.mean(scores), np.std(scores)

4. Best Practices:

 Always shuffle data before splitting


 Use stratification for imbalanced datasets
 Scale features within each fold, not before splitting
 Save model performance metrics for each fold
 Consider time-based splitting for time series data

5. Error Analysis:
def analyze_cv_results(model, X, y, k=5):
kf = KFold(n_splits=k, shuffle=True, random_state=42)
fold_errors = []

for train_idx, val_idx in kf.split(X):


X_train, X_val = X[train_idx], X[val_idx]
y_train, y_val = y[train_idx], y[val_idx]

model.fit(X_train, y_train)
y_pred = model.predict(X_val)

# Calculate residuals
residuals = y_val - y_pred
fold_errors.append(residuals)

return fold_errors

Question 4:

Hyperparameter Tuning (XGBoost): This question tests your ability to fine-tune a gradient boosting model
by adjusting various parameters to combat overfitting, requiring understanding of how different
hyperparameters affect model behavior.

This is the hyperparameter tuning for XGBoost to address overfitting.

XGBoost Hyperparameter Tuning Strategy:

1. Key Parameters to Address Overfitting:

params = {
# Reduce model complexity
'max_depth': 4, # Decrease from default 6
'min_child_weight': 5, # Increase to require more observations per leaf

# Slow down learning


'learning_rate': 0.01, # Decrease from default 0.1
'n_estimators': 1000, # Increase to compensate for lower learning rate

# Add randomness
'subsample': 0.8, # Use 80% of data per tree
'colsample_bytree': 0.8, # Use 80% of features per tree

# Increase regularization
'reg_alpha': 0.1, # L1 regularization
'reg_lambda': 2.0 # L2 regularization
}

2. Implementation with Early Stopping:


from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split

def train_xgb_with_early_stopping(X, y):


# Split data
X_train, X_val, y_train, y_val = train_test_split(
X, y, test_size=0.2, random_state=42
)

# Initialize model
model = XGBRegressor(**params)

# Train with early stopping


model.fit(
X_train, y_train,
eval_set=[(X_train, y_train), (X_val, y_val)],
early_stopping_rounds=50,
eval_metric='rmse',
verbose=100
)

return model

3. Grid Search for Optimal Parameters:

from sklearn.model_selection import GridSearchCV

def tune_xgb_parameters(X, y):


param_grid = {
'max_depth': [3, 4, 5],
'min_child_weight': [3, 5, 7],
'learning_rate': [0.01, 0.05],
'subsample': [0.7, 0.8, 0.9],
'colsample_bytree': [0.7, 0.8, 0.9]
}

model = XGBRegressor()
grid_search = GridSearchCV(
estimator=model,
param_grid=param_grid,
cv=5,
scoring='neg_mean_squared_error',
n_jobs=-1
)

grid_search.fit(X, y)
return grid_search.best_params_

4. Monitoring and Validation:


def validate_xgb_model(model, X_train, X_val, y_train, y_val):
# Training metrics
train_pred = model.predict(X_train)
train_rmse = np.sqrt(mean_squared_error(y_train, train_pred))

# Validation metrics
val_pred = model.predict(X_val)
val_rmse = np.sqrt(mean_squared_error(y_val, val_pred))

print(f"Training RMSE: {train_rmse:.4f}")


print(f"Validation RMSE: {val_rmse:.4f}")

# Check for overfitting


if train_rmse / val_rmse < 0.8:
print("Warning: Model might be overfitting")

 Progressive Tuning Approach:


 Start with high learning rate (0.1) to identify other parameters
 Tune tree-specific parameters (max_depth, min_child_weight)
 Tune regularization parameters (reg_alpha, reg_lambda)
 Lower learning rate and increase n_estimators

 Fine-tune subsample and colsample parameters

Question 5:

Evaluation Metrics (Imbalanced Classification): This question assesses your knowledge of appropriate met-
rics for evaluating binary classification models when dealing with imbalanced datasets, where traditional
accuracy might be misleading.

I will explain the evaluation metrics for imbalanced binary classification datasets:

For imbalanced datasets, I would prioritize these metrics:

1. Area Under Precision-Recall Curve (AUPRC):

from sklearn.metrics import precision_recall_curve, auc

def calculate_auprc(y_true, y_pred_proba):


precision, recall, _ = precision_recall_curve(y_true, y_pred_proba)
auprc = auc(recall, precision)
return auprc

Why AUPRC is preferred:


 Not affected by large number of true negatives
 More sensitive to improvements in minority class
 Better represents model performance when classes are imbalanced
 Focuses on positive class prediction quality

2. Supporting Metrics:

a) F1-Score:

from sklearn.metrics import f1_score

def calculate_f1(y_true, y_pred):


return f1_score(y_true, y_pred)

 Harmonic mean of precision and recall


 Balances false positives and false negatives

b) Matthews Correlation Coefficient (MCC):

from sklearn.metrics import matthews_corrcoef

def calculate_mcc(y_true, y_pred):


return matthews_corrcoef(y_true, y_pred)

 Provides balanced measure even with varying class sizes


 Takes into account all four confusion matrix categories

3. Comprehensive Evaluation Function:

def evaluate_imbalanced_classifier(y_true, y_pred, y_pred_proba):


metrics = {
'auprc': calculate_auprc(y_true, y_pred_proba),
'f1': calculate_f1(y_true, y_pred),
'mcc': calculate_mcc(y_true, y_pred),
'recall': recall_score(y_true, y_pred),
'precision': precision_score(y_true, y_pred)
}
return metrics

4. Why Not Accuracy?

 Can be misleading with imbalanced data


 Example: 98% accuracy on dataset with 98% majority class
 Doesn't reflect model's ability to identify minority class

5. Additional Considerations:

 Use stratified sampling in cross-validation


 Consider different probability thresholds
 Monitor both classes' performance separately
 Use confusion matrix for detailed analysis
Example implementation with threshold optimization:

def optimize_threshold(y_true, y_pred_proba):


thresholds = np.arange(0.1, 0.9, 0.1)
best_f1 = 0
best_threshold = 0.5

for threshold in thresholds:


y_pred = (y_pred_proba >= threshold).astype(int)
f1 = f1_score(y_true, y_pred)

if f1 > best_f1:
best_f1 = f1
best_threshold = threshold

return best_threshold

This comprehensive approach ensures proper evaluation of model performance on imbalanced datasets,
focusing on metrics that matter for the minority class while maintaining overall model quality.

You might also like