Ticket 2
Ticket 2
Ülvi
Ticket 2
1. Evaluation Metrics: You are building a machine learning model to classify defective parts
in a manufacturing process. The cost of missing a defective part is high. Which
evaluation metric would you prioritize, and why?
2. Regularization: You are working on a model for predicting customer churn, but you find
that some features are causing overfitting. How would you use Lasso regularization to
simplify the model while maintaining performance?
4. Ensemble Methods: You are working on a model for classifying images of plants.
Describe how you would use ensemble methods such as Random Forest or Gradient
Boosting to improve model performance.
5. Data Preprocessing: You have a text dataset containing product reviews in multiple
languages. How would you preprocess this data to train a sentiment analysis model?
Answer 1
When building a machine learning model to classify defective parts in a manufacturing process,
you need to prioritize metrics that address the high cost of missing defective parts (False
Negatives). The key metric to focus on is Recall, along with other supporting metrics.
Practical Implementation
Here's how you can calculate these metrics in Python:
# Calculate metrics
recall = recall_score(y_true, y_pred)
precision = precision_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)
conf_matrix = confusion_matrix(y_true, y_pred)
print(f"Recall: {recall}")
print(f"Precision: {precision}")
print(f"F1-Score: {f1}")
print(f"Confusion Matrix:\n{conf_matrix}")
Answer 2
Problem: Overfitting occurs when the model performs well on the training data but poorly on
the test data, often due to irrelevant or redundant features.
To address this, Lasso Regularization (L1 Regularization) can be used. Lasso not only reduces
overfitting by penalizing large coefficients but also performs feature selection by driving some
coefficients to exactly zero, effectively removing less important features from the model.
Where:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
Lasso can be applied using a linear regression model (Lasso) or logistic regression for
classification (LogisticRegression with L1 penalty).
For regression:
# Best model
best_lasso = grid_search.best_estimator_
print("Best alpha:", grid_search.best_params_['alpha'])
For classification:
# Metrics
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
5. Inspect Feature Importance
Lasso shrinks less important feature coefficients to zero. You can inspect which features are
retained:
# Feature importance
coefficients = best_lasso.coef_
feature_importance = pd.DataFrame({
'Feature': feature_names,
'Coefficient': coefficients
})
# Filter non-zero coefficients
important_features = feature_importance[feature_importance['Coefficient'] !=
0]
print(important_features)
1. Choosing λ (alpha):
o Use cross-validation to find the optimal value.
o Larger λ simplifies the model but risks underfitting.
o Smaller λ retains more features but risks overfitting.
2. Impact on Features:
o Features with coefficients reduced to zero are effectively removed.
o Helps simplify the model by retaining only the most relevant predictors for
customer churn.
3. Model Monitoring:
o After applying Lasso, monitor the model’s performance to ensure it generalizes
well on unseen data.
o Regularly update the model if new patterns or features emerge in the data.
Objective: Partition the data into k clusters, where each customer belongs to the cluster
with the nearest centroid.
How it Works:
1. Initialize k centroids randomly.
2. Assign each data point to the nearest centroid.
3. Update the centroids based on the mean of assigned points.
4. Repeat steps 2–3 until the centroids stabilize or a stopping criterion is met.
Finding the right number of clusters is crucial. Here are common methods:
A. Elbow Method
B. Silhouette Score
Measures how well-separated the clusters are. Higher scores indicate better-defined
clusters.
Calculate for different values of k:
from sklearn.metrics import silhouette_score
for k in range(2, 11):
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(X_scaled)
score = silhouette_score(X_scaled, kmeans.labels_)
print(f'For k={k}, Silhouette Score = {score}')
C. Gap Statistic
1. Sensitivity to Initialization:
o Use the k-means++ initialization to improve performance.
o Example:
o kmeans = KMeans(n_clusters=3, init='k-means++', random_state=42)
2. Requires Numeric Data:
o Encode categorical variables using techniques like one-hot encoding.
3. Assumes Spherical Clusters:
o If the clusters are not spherical, consider alternatives like Gaussian Mixture
Models or DBSCAN.
4. Scaling:
o Always standardize or normalize features to prevent dominance by larger scale
variables.
6. Example Output
These insights can guide targeted marketing, personalized offers, and inventory management.
Conclusion
Answer 4
Using Ensemble Methods for Image Classification: Random Forest and Gradient
Boosting
Ensemble methods like Random Forest and Gradient Boosting are powerful techniques to
improve model performance by combining multiple weak learners into a strong learner. In the
context of classifying images of plants, ensemble methods can be used to enhance accuracy,
robustness, and generalization.
Convert raw images into numerical features using one of the following:
o Manual Feature Extraction: Use texture, color histograms, or shape descriptors.
o Deep Features: Use a pre-trained convolutional neural network (e.g., ResNet,
VGG) to extract features.
from tensorflow.keras.applications import ResNet50
from tensorflow.keras.applications.resnet50 import preprocess_input
import numpy as np
# Load a pre-trained model for feature extraction
model = ResNet50(weights='imagenet', include_top=False, pooling='avg')
# Preprocess the image
img = preprocess_input(image_array)
features = model.predict(img) # Extract deep features
Normalize Features: Scale the extracted features for better performance in ensemble
models:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(features)
Random Forest builds multiple decision trees and combines their outputs (majority voting for
classification).
Advantages:
Implementation:
# Split data
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y,
test_size=0.2, random_state=42)
# Evaluate model
y_pred = rf_model.predict(X_test)
print("Random Forest Accuracy:", accuracy_score(y_test, y_pred))
Tuning Parameters:
Gradient Boosting improves model performance by sequentially adding weak learners (trees) and
correcting errors made by previous models.
Advantages:
# Evaluate model
y_pred = gb_model.predict(X_test)
print("Gradient Boosting Accuracy:", accuracy_score(y_test, y_pred))
Tuning Parameters:
y_pred = xgb_model.predict(X_test)
print("XGBoost Accuracy:", accuracy_score(y_test, y_pred))
Use metrics like accuracy, precision, recall, and F1-score to evaluate performance.
You can combine multiple ensemble methods to leverage their strengths (stacking or blending):
estimators = [
('rf', RandomForestClassifier(n_estimators=100, random_state=42)),
('gb', GradientBoostingClassifier(n_estimators=100, random_state=42))
]
stack_model = StackingClassifier(estimators=estimators,
final_estimator=XGBClassifier())
stack_model.fit(X_train, y_train)
Random Forest:
o Handles high-dimensional data well.
o Robust to overfitting and noise.
o Less sensitive to hyperparameter tuning.
Gradient Boosting:
o Delivers higher accuracy in many cases.
o Requires careful tuning of hyperparameters.
o Better suited for complex patterns in data.
5. Conclusion
By leveraging ensemble methods, you can build a robust and accurate classifier for plant image
recognition.
Answer 5
When dealing with a multilingual text dataset for sentiment analysis, preprocessing is essential to
ensure the text is clean, consistent, and suitable for the model. Below is a detailed step-by-step
guide:
Remove common words that do not contribute to sentiment (e.g., "and", "the").
Use language-specific stopword lists:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english')) # Adjust for each language
data['tokens'] = data['tokens'].apply(lambda x: [word for word in x if
word not in stop_words])
Step 5: Lemmatization
If supporting one language, translate all reviews using an API like Google Translate or
Hugging Face Transformers:
from transformers import MarianMTModel, MarianTokenizer
model_name = 'Helsinki-NLP/opus-mt-es-en' # Spanish to English
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
def translate(text):
inputs = tokenizer(text, return_tensors="pt", truncation=True)
outputs = model.generate(**inputs)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
data['translated_review'] = data['review'].apply(translate)
Instead of translating, use models that support multiple languages natively (e.g.,
mBERT, XLM-R).
To feed the text into a machine learning model, convert it into numerical format.
A. Bag-of-Words (BoW)