0% found this document useful (0 votes)
35 views17 pages

Ticket 2

The document discusses various machine learning concepts including evaluation metrics, regularization, unsupervised learning, ensemble methods, and data preprocessing. It emphasizes the importance of Recall in classifying defective parts, the use of Lasso regularization to prevent overfitting, and the application of K-Means clustering for customer segmentation. Additionally, it highlights the benefits of ensemble methods like Random Forest and Gradient Boosting for improving image classification performance.

Uploaded by

gunelaslanova106
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views17 pages

Ticket 2

The document discusses various machine learning concepts including evaluation metrics, regularization, unsupervised learning, ensemble methods, and data preprocessing. It emphasizes the importance of Recall in classifying defective parts, the use of Lasso regularization to prevent overfitting, and the application of K-Means clustering for customer segmentation. Additionally, it highlights the benefits of ensemble methods like Random Forest and Gradient Boosting for improving image classification performance.

Uploaded by

gunelaslanova106
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 17

Machine Learning

MBA (Artificial Intelligence) E27-24

Ülvi

Ticket 2
1. Evaluation Metrics: You are building a machine learning model to classify defective parts
in a manufacturing process. The cost of missing a defective part is high. Which
evaluation metric would you prioritize, and why?

2. Regularization: You are working on a model for predicting customer churn, but you find
that some features are causing overfitting. How would you use Lasso regularization to
simplify the model while maintaining performance?

3. Unsupervised Learning: A retail store wants to identify customer segments based on


shopping patterns. Explain how you would use the k-means clustering algorithm for this
task, and how you would determine the optimal number of clusters.

4. Ensemble Methods: You are working on a model for classifying images of plants.
Describe how you would use ensemble methods such as Random Forest or Gradient
Boosting to improve model performance.

5. Data Preprocessing: You have a text dataset containing product reviews in multiple
languages. How would you preprocess this data to train a sentiment analysis model?
Answer 1
When building a machine learning model to classify defective parts in a manufacturing process,
you need to prioritize metrics that address the high cost of missing defective parts (False
Negatives). The key metric to focus on is Recall, along with other supporting metrics.

Key Metrics to Use

1. Recall (Sensitivity or True Positive Rate)


Recall measures the proportion of defective parts correctly identified:
2. Recall = True Positives / (True Positives + False Negatives)

Why prioritize Recall?

o High Recall ensures most defective parts are detected.


o Reduces False Negatives, which is critical when missing defective parts has
significant consequences.
3. Precision
Precision measures the proportion of predicted defective parts that are actually defective:
4. Precision = True Positives / (True Positives + False Positives)

Why not focus only on Precision?

oWhile Precision is important, the cost of False Negatives is higher in this


scenario. A balance with Recall is necessary.
5. F1-Score
F1-Score is the harmonic mean of Precision and Recall, offering a balance between the
two:
6. F1-Score = 2 * (Precision * Recall) / (Precision + Recall)

Why use F1-Score?

o It helps balance Precision and Recall, ensuring neither metric is neglected.


7. Confusion Matrix
The confusion matrix provides a detailed breakdown of predictions:
8. True Positives (TP): Correctly identified defective parts.
9. True Negatives (TN): Correctly identified non-defective parts.
10. False Positives (FP): Non-defective parts marked as defective.
11. False Negatives (FN): Defective parts missed by the model.

Practical Implementation
Here's how you can calculate these metrics in Python:

from sklearn.metrics import recall_score, precision_score, f1_score,


confusion_matrix

# Ground truth and predictions


y_true = [1, 0, 1, 1, 0, 1, 0, 0, 1, 0] # Actual labels
y_pred = [1, 0, 1, 1, 1, 1, 0, 0, 0, 0] # Model predictions

# Calculate metrics
recall = recall_score(y_true, y_pred)
precision = precision_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)
conf_matrix = confusion_matrix(y_true, y_pred)

print(f"Recall: {recall}")
print(f"Precision: {precision}")
print(f"F1-Score: {f1}")
print(f"Confusion Matrix:\n{conf_matrix}")

Strategies for Prioritizing Recall

1. Adjust Classification Threshold


Lower the threshold to classify more parts as defective:
2. y_pred_prob = model.predict_proba(X_test)[:, 1]
3. threshold = 0.3 # Adjust threshold
4. y_pred = (y_pred_prob >= threshold).astype(int)
5. Class Weights
Assign higher weights to the defective class to penalize False Negatives more:
6. model = RandomForestClassifier(class_weight={'defective': 10, 'non-
defective': 1})
7. Resampling Techniques
Handle class imbalance by oversampling the minority class:
8. from imblearn.over_sampling import SMOTE
9. X_resampled, y_resampled = SMOTE().fit_resample(X_train, y_train)

Why Prioritize Recall?

 False Negatives (Missed Defective Parts):


Missing defective parts can lead to:
o Product failures.
o Customer dissatisfaction.
o Safety risks.
 False Positives (Non-Defective Parts Flagged as Defective):
While inconvenient, False Positives have a lower cost since parts can be rechecked
manually.
Conclusion

For defective part classification:

1. Prioritize Recall to minimize False Negatives.


2. Use F1-Score and the Confusion Matrix for a balanced evaluation.
3. Adjust thresholds and use class weights to align the model with business objectives.

Answer 2
Problem: Overfitting occurs when the model performs well on the training data but poorly on
the test data, often due to irrelevant or redundant features.

To address this, Lasso Regularization (L1 Regularization) can be used. Lasso not only reduces
overfitting by penalizing large coefficients but also performs feature selection by driving some
coefficients to exactly zero, effectively removing less important features from the model.

Steps to Apply Lasso Regularization

1. Understand Lasso Regularization

Lasso adds a penalty term to the loss function of the model:

Loss = (Prediction Error) + λ * Σ|βj|

Where:

 βj are the coefficients of the features.


 λ is the regularization parameter:
o Larger λ values increase the penalty, leading to more coefficients shrinking to
zero.
o Smaller λ values reduce the penalty, retaining more features.

2. Preprocess the Data

Lasso is sensitive to feature scaling, so it's essential to normalize the features:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

3. Train the Model with Lasso Regularization

Lasso can be applied using a linear regression model (Lasso) or logistic regression for
classification (LogisticRegression with L1 penalty).

For regression:

from sklearn.linear_model import Lasso


from sklearn.model_selection import GridSearchCV

# Define the Lasso model


lasso = Lasso()

# Use Grid Search to find the optimal λ (alpha in sklearn)


param_grid = {'alpha': [0.01, 0.1, 1, 10, 100]} # Regularization strength
grid_search = GridSearchCV(lasso, param_grid, cv=5,
scoring='neg_mean_squared_error')
grid_search.fit(X_train_scaled, y_train)

# Best model
best_lasso = grid_search.best_estimator_
print("Best alpha:", grid_search.best_params_['alpha'])

# Train the model


best_lasso.fit(X_train_scaled, y_train)

For classification:

from sklearn.linear_model import LogisticRegression

lasso_logistic = LogisticRegression(penalty='l1', solver='liblinear', C=1.0)


# C is 1/λ
lasso_logistic.fit(X_train_scaled, y_train)

4. Evaluate the Model

Evaluate the model on the test set to ensure performance is maintained:

from sklearn.metrics import accuracy_score, precision_score, recall_score

# Predict on the test set


y_pred = best_lasso.predict(X_test_scaled)

# Metrics
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
5. Inspect Feature Importance

Lasso shrinks less important feature coefficients to zero. You can inspect which features are
retained:

# Feature importance
coefficients = best_lasso.coef_
feature_importance = pd.DataFrame({
'Feature': feature_names,
'Coefficient': coefficients
})
# Filter non-zero coefficients
important_features = feature_importance[feature_importance['Coefficient'] !=
0]
print(important_features)

Tuning and Interpretation

1. Choosing λ (alpha):
o Use cross-validation to find the optimal value.
o Larger λ simplifies the model but risks underfitting.
o Smaller λ retains more features but risks overfitting.
2. Impact on Features:
o Features with coefficients reduced to zero are effectively removed.
o Helps simplify the model by retaining only the most relevant predictors for
customer churn.
3. Model Monitoring:
o After applying Lasso, monitor the model’s performance to ensure it generalizes
well on unseen data.
o Regularly update the model if new patterns or features emerge in the data.

Why Use Lasso for Customer Churn?

 Feature Selection: Automatically eliminates less relevant features, simplifying the


model.
 Overfitting Prevention: Penalizes complex models, ensuring better generalization to
unseen data.
 Interpretability: Retains only the most critical predictors of churn, making the model
easier to interpret.
Answer 3
The retail store can use K-Means Clustering, an unsupervised learning algorithm, to segment
customers based on shopping patterns. This technique groups customers with similar behaviors
into clusters, allowing the store to target each group more effectively.

1. Understanding K-Means Clustering

 Objective: Partition the data into k clusters, where each customer belongs to the cluster
with the nearest centroid.
 How it Works:
1. Initialize k centroids randomly.
2. Assign each data point to the nearest centroid.
3. Update the centroids based on the mean of assigned points.
4. Repeat steps 2–3 until the centroids stabilize or a stopping criterion is met.

2. Steps to Use K-Means for Customer Segmentation

Step 1: Collect and Preprocess Data

 Input Data: Examples of customer data could include:


o Average transaction amount.
o Frequency of visits.
o Product categories purchased.
o Time spent shopping.
 Data Preprocessing:
o Handle missing values by imputing or removing incomplete records.
o Normalize the data to ensure all features are on the same scale:
o from sklearn.preprocessing import StandardScaler
o
o scaler = StandardScaler()
o X_scaled = scaler.fit_transform(X)

Step 2: Apply K-Means Clustering

 Fit the K-Means algorithm to the data:


 from sklearn.cluster import KMeans

 kmeans = KMeans(n_clusters=3, random_state=42) # Start with 3 clusters
as an example
 kmeans.fit(X_scaled)
 labels = kmeans.labels_ # Cluster assignments for each customer
 Add cluster labels to the dataset for interpretation:
 import pandas as pd
 data['Cluster'] = labels

Step 3: Evaluate and Interpret Results

 Analyze the characteristics of each cluster:


 cluster_summary = data.groupby('Cluster').mean()
 print(cluster_summary)
 Visualize clusters (if data is 2D or 3D):
 import matplotlib.pyplot as plt

 plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=labels, cmap='viridis')
 plt.title('Customer Segments')
 plt.xlabel('Feature 1')
 plt.ylabel('Feature 2')
 plt.show()

3. Determining the Optimal Number of Clusters (k)

Finding the right number of clusters is crucial. Here are common methods:

A. Elbow Method

 Plot the Sum of Squared Distances (SSD) for different values of k.


 Choose the k where the SSD curve bends (elbow point):
 ssd = []
 for k in range(1, 11):
 kmeans = KMeans(n_clusters=k, random_state=42)
 kmeans.fit(X_scaled)
 ssd.append(kmeans.inertia_) # Sum of squared distances to the
nearest centroid

 plt.plot(range(1, 11), ssd, marker='o')
 plt.title('Elbow Method')
 plt.xlabel('Number of Clusters')
 plt.ylabel('SSD')
 plt.show()

B. Silhouette Score

 Measures how well-separated the clusters are. Higher scores indicate better-defined
clusters.
 Calculate for different values of k:
 from sklearn.metrics import silhouette_score

 for k in range(2, 11):
 kmeans = KMeans(n_clusters=k, random_state=42)
 kmeans.fit(X_scaled)
 score = silhouette_score(X_scaled, kmeans.labels_)
 print(f'For k={k}, Silhouette Score = {score}')

C. Gap Statistic

 Compares the total within-cluster variation to that of a null reference distribution.


 This method is more robust but requires additional computation.

4. Advantages of K-Means for Customer Segmentation

1. Simplicity: Easy to implement and computationally efficient.


2. Interpretability: Clusters provide actionable insights for marketing strategies.
3. Scalability: Suitable for large datasets.

5. Limitations and Considerations

1. Sensitivity to Initialization:
o Use the k-means++ initialization to improve performance.
o Example:
o kmeans = KMeans(n_clusters=3, init='k-means++', random_state=42)
2. Requires Numeric Data:
o Encode categorical variables using techniques like one-hot encoding.
3. Assumes Spherical Clusters:
o If the clusters are not spherical, consider alternatives like Gaussian Mixture
Models or DBSCAN.
4. Scaling:
o Always standardize or normalize features to prevent dominance by larger scale
variables.

6. Example Output

After applying K-Means, you might identify clusters such as:

 Cluster 1: High-spending customers with frequent purchases.


 Cluster 2: Budget-conscious customers with infrequent purchases.
 Cluster 3: Loyal customers who shop moderately but regularly.

These insights can guide targeted marketing, personalized offers, and inventory management.
Conclusion

To segment customers based on shopping patterns:

1. Preprocess and scale the data.


2. Use K-Means to group customers into meaningful clusters.
3. Determine the optimal number of clusters using methods like the Elbow Method or
Silhouette Score.
4. Interpret and act on the results to enhance customer experience and business strategies.

Answer 4

Using Ensemble Methods for Image Classification: Random Forest and Gradient
Boosting

Ensemble methods like Random Forest and Gradient Boosting are powerful techniques to
improve model performance by combining multiple weak learners into a strong learner. In the
context of classifying images of plants, ensemble methods can be used to enhance accuracy,
robustness, and generalization.

1. Why Use Ensemble Methods for Image Classification?

 Diversity: Combine multiple weak models to reduce bias and variance.


 Robustness: Handle noise and overfitting better than individual models.
 Improved Accuracy: Leverage the strengths of multiple learners.

2. Steps to Use Ensemble Methods


Step 1: Preprocess Image Data

 Convert raw images into numerical features using one of the following:
o Manual Feature Extraction: Use texture, color histograms, or shape descriptors.
o Deep Features: Use a pre-trained convolutional neural network (e.g., ResNet,
VGG) to extract features.
 from tensorflow.keras.applications import ResNet50
 from tensorflow.keras.applications.resnet50 import preprocess_input
 import numpy as np

 # Load a pre-trained model for feature extraction
 model = ResNet50(weights='imagenet', include_top=False, pooling='avg')

 # Preprocess the image
 img = preprocess_input(image_array)
 features = model.predict(img) # Extract deep features

 Normalize Features: Scale the extracted features for better performance in ensemble
models:
 from sklearn.preprocessing import StandardScaler

 scaler = StandardScaler()
 X_scaled = scaler.fit_transform(features)

Step 2: Random Forest for Classification

Random Forest builds multiple decision trees and combines their outputs (majority voting for
classification).

Advantages:

 Handles high-dimensional features well.


 Robust to overfitting due to bagging (bootstrap aggregation).

Implementation:

from sklearn.ensemble import RandomForestClassifier


from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Split data
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y,
test_size=0.2, random_state=42)

# Train Random Forest


rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Evaluate model
y_pred = rf_model.predict(X_test)
print("Random Forest Accuracy:", accuracy_score(y_test, y_pred))

Tuning Parameters:

 n_estimators: Number of trees.


 max_depth: Maximum depth of each tree.
 max_features: Number of features considered for each split.
 Use Grid Search for hyperparameter tuning:
 from sklearn.model_selection import GridSearchCV

 param_grid = {
 'n_estimators': [100, 200, 300],
 'max_depth': [10, 20, None],
 'max_features': ['sqrt', 'log2']
 }

 grid_search = GridSearchCV(RandomForestClassifier(random_state=42),
param_grid, cv=5)
 grid_search.fit(X_train, y_train)
 print("Best Parameters:", grid_search.best_params_)

Step 3: Gradient Boosting for Classification

Gradient Boosting improves model performance by sequentially adding weak learners (trees) and
correcting errors made by previous models.

Advantages:

 Focuses on minimizing prediction errors iteratively.


 Provides superior accuracy in many scenarios.

Implementation with Gradient Boosting:

from sklearn.ensemble import GradientBoostingClassifier

# Train Gradient Boosting


gb_model = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1,
max_depth=3, random_state=42)
gb_model.fit(X_train, y_train)

# Evaluate model
y_pred = gb_model.predict(X_test)
print("Gradient Boosting Accuracy:", accuracy_score(y_test, y_pred))

Tuning Parameters:

 n_estimators: Number of trees.


 learning_rate: Step size for each iteration.
 max_depth: Depth of each tree.

Using XGBoost for Faster Performance: XGBoost is an optimized implementation of Gradient


Boosting:

from xgboost import XGBClassifier


xgb_model = XGBClassifier(n_estimators=100, learning_rate=0.1, max_depth=3,
random_state=42)
xgb_model.fit(X_train, y_train)

y_pred = xgb_model.predict(X_test)
print("XGBoost Accuracy:", accuracy_score(y_test, y_pred))

Step 4: Model Evaluation

 Use metrics like accuracy, precision, recall, and F1-score to evaluate performance.

from sklearn.metrics import classification_report, confusion_matrix

print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))


print("Classification Report:\n", classification_report(y_test, y_pred))

Step 5: Combining Ensemble Methods

You can combine multiple ensemble methods to leverage their strengths (stacking or blending):

from sklearn.ensemble import StackingClassifier

estimators = [
('rf', RandomForestClassifier(n_estimators=100, random_state=42)),
('gb', GradientBoostingClassifier(n_estimators=100, random_state=42))
]

stack_model = StackingClassifier(estimators=estimators,
final_estimator=XGBClassifier())
stack_model.fit(X_train, y_train)

# Evaluate stacked model


y_pred = stack_model.predict(X_test)
print("Stacking Accuracy:", accuracy_score(y_test, y_pred))

4. When to Use Random Forest or Gradient Boosting

 Random Forest:
o Handles high-dimensional data well.
o Robust to overfitting and noise.
o Less sensitive to hyperparameter tuning.

 Gradient Boosting:
o Delivers higher accuracy in many cases.
o Requires careful tuning of hyperparameters.
o Better suited for complex patterns in data.
5. Conclusion

To classify images of plants:

1. Extract and preprocess features from images.


2. Train models using Random Forest or Gradient Boosting.
3. Use hyperparameter tuning and evaluation metrics to optimize performance.
4. Combine models using stacking for further performance gains.

By leveraging ensemble methods, you can build a robust and accurate classifier for plant image
recognition.

Answer 5
When dealing with a multilingual text dataset for sentiment analysis, preprocessing is essential to
ensure the text is clean, consistent, and suitable for the model. Below is a detailed step-by-step
guide:

1. Understand the Dataset

 Inspect the data: Examine the structure, languages, and labels.


 Common challenges:
o Multiple languages.
o Variations in text quality (e.g., typos, special characters).
o Imbalanced classes in sentiment labels (e.g., positive, negative, neutral).

2. Steps for Preprocessing

Step 1: Handle Missing Data

 Check for missing or empty reviews and remove them:


 import pandas as pd
 # Remove rows with missing reviews
 data = data.dropna(subset=['review'])

Step 2: Language Detection


 Use a library like langdetect or langid to identify the language of each review:
 from langdetect import detect

 data['language'] = data['review'].apply(lambda x: detect(x))
 If your model will only support specific languages, filter out unsupported ones:
 supported_languages = ['en', 'es', 'fr']
 data = data[data['language'].isin(supported_languages)]

Step 3: Text Normalization

 Lowercasing: Convert text to lowercase to reduce variability.


 data['review'] = data['review'].str.lower()
 Remove special characters, punctuation, and numbers:
 import re

 data['review'] = data['review'].apply(lambda x: re.sub(r'[^a-zA-Z\s]',
'', x))
 Tokenization: Split text into individual words.
 from nltk.tokenize import word_tokenize

 data['tokens'] = data['review'].apply(lambda x: word_tokenize(x))

Step 4: Stopword Removal

 Remove common words that do not contribute to sentiment (e.g., "and", "the").
 Use language-specific stopword lists:
 from nltk.corpus import stopwords

 stop_words = set(stopwords.words('english')) # Adjust for each language
 data['tokens'] = data['tokens'].apply(lambda x: [word for word in x if
word not in stop_words])

Step 5: Lemmatization

 Reduce words to their base form to normalize variations.


 Use language-specific lemmatizers:
 from nltk.stem import WordNetLemmatizer

 lemmatizer = WordNetLemmatizer()
 data['tokens'] = data['tokens'].apply(lambda x:
[lemmatizer.lemmatize(word) for word in x])

Step 6: Convert Tokens Back to Text

 Reassemble processed tokens into cleaned text for model training:


 data['processed_review'] = data['tokens'].apply(lambda x: ' '.join(x))

3. Handling Multilingual Data


A. Translate to a Single Language

 If supporting one language, translate all reviews using an API like Google Translate or
Hugging Face Transformers:
 from transformers import MarianMTModel, MarianTokenizer

 model_name = 'Helsinki-NLP/opus-mt-es-en' # Spanish to English
 tokenizer = MarianTokenizer.from_pretrained(model_name)
 model = MarianMTModel.from_pretrained(model_name)

 def translate(text):
 inputs = tokenizer(text, return_tensors="pt", truncation=True)
 outputs = model.generate(**inputs)
 return tokenizer.decode(outputs[0], skip_special_tokens=True)

 data['translated_review'] = data['review'].apply(translate)

B. Use Multilingual Embeddings

 Instead of translating, use models that support multiple languages natively (e.g.,
mBERT, XLM-R).

4. Vectorize the Text

To feed the text into a machine learning model, convert it into numerical format.

A. Bag-of-Words (BoW)

 Represent text as a vector of word frequencies.


 from sklearn.feature_extraction.text import CountVectorizer

 vectorizer = CountVectorizer()
 X = vectorizer.fit_transform(data['processed_review'])

B. Term Frequency-Inverse Document Frequency (TF-IDF)

 Weigh words by importance across documents.


 from sklearn.feature_extraction.text import TfidfVectorizer

 tfidf = TfidfVectorizer()
 X = tfidf.fit_transform(data['processed_review'])

C. Pretrained Word Embeddings

 Use pretrained embeddings like GloVe, FastText, or Word2Vec.


 Example with GloVe:
 import gensim.downloader as api

 glove = api.load("glove-wiki-gigaword-50")
 data['embedding'] = data['tokens'].apply(lambda x: [glove[word] for word
in x if word in glove])

D. Deep Learning Embeddings

 Use embeddings from transformer models like BERT or RoBERTa:


 from transformers import BertTokenizer, BertModel

 tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-
cased')
 model = BertModel.from_pretrained('bert-base-multilingual-cased')

 def get_bert_embeddings(text):
 inputs = tokenizer(text, return_tensors="pt", truncation=True,
padding=True)
 outputs = model(**inputs)
 return outputs.last_hidden_state.mean(dim=1)

 data['bert_embedding'] =
data['processed_review'].apply(get_bert_embeddings)

5. Train and Evaluate the Model

 Split the data into training and test sets:


 from sklearn.model_selection import train_test_split

 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
 Train a model (e.g., Logistic Regression, SVM, or a deep learning model):
 from sklearn.linear_model import LogisticRegression

 model = LogisticRegression()
 model.fit(X_train, y_train)
 Evaluate the model:
 from sklearn.metrics import classification_report

 y_pred = model.predict(X_test)
 print(classification_report(y_test, y_pred))

6. Summary of Preprocessing Steps

1. Handle missing data.


2. Detect and filter languages if necessary.
3. Normalize text (lowercase, remove punctuation, tokenize).
4. Remove stopwords and lemmatize words.
5. Translate text or use multilingual models for consistency.
6. Convert text into numerical features using methods like TF-IDF or embeddings.

You might also like