Classification
Classification
# Explanation:
# - pandas, numpy: Data handling
# - matplotlib, seaborn: Visualization
# - sklearn: Machine Learning tools
# - xgboost: Advanced ensemble model
# Explanation:
# - Read the dataset into a DataFrame.
# - Inspect the first few rows to understand the structure.
label_encoders = {}
for col in categorical_cols:
le = LabelEncoder()
df[col] = le.fit_transform(df[col])
label_encoders[col] = le
# Explanation:
# - LabelEncoder transforms text categories into numbers (e.g., 'married' -> 1).
# - We store each encoder for possible inverse-transform later.
scaler = StandardScaler()
df[numerical_cols] = scaler.fit_transform(df[numerical_cols])
# Explanation:
# - StandardScaler centers data (mean = 0, standard deviation = 1).
# - Helps algorithms that are sensitive to feature scaling.
# Explanation:
# - 80% data for training, 20% for testing.
# - stratify=y ensures the same proportion of classes in train and test sets.
# Explanation:
# - Logistic Regression: Simple baseline model.
# - Random Forest: Ensemble method using decision trees.
# - XGBoost: Advanced gradient boosting technique, highly accurate.
# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(5, 4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title(f'{name} - Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()
# Explanation:
# - classification_report shows precision, recall, f1-score, and support.
# - confusion_matrix visualizes true vs predicted classes.
"""
1. What is the difference between Logistic Regression and Linear Regression?
2. Why do we need to scale features before training certain models?
3. What is Stratified Sampling? Why do we use it in classification?
4. What are Precision, Recall, and F1-score?
5. What is the importance of a Confusion Matrix?
6. What is Overfitting and how can you prevent it?
7. Why would you choose Random Forest over a simple Decision Tree?
8. What is Gradient Boosting? How is it different from Random Forest?
9. How does XGBoost improve model performance?
10. How would you handle an imbalanced dataset?
11. What metrics would you monitor for a classification model?
12. Explain why feature encoding is needed.
13. What is Label Encoding vs One Hot Encoding?
14. Why would longer call duration affect subscription likelihood?
15. How would you improve the performance of this classification model?
"""