0% found this document useful (0 votes)
34 views

Assign 3

This document discusses and compares several model evaluation techniques: leave-one-out cross-validation, K-fold cross-validation, the holdout method, time series split, and shuffle split. For each technique, it provides examples of advantages and disadvantages, along with sample Python code demonstrating implementation.

Uploaded by

Rana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views

Assign 3

This document discusses and compares several model evaluation techniques: leave-one-out cross-validation, K-fold cross-validation, the holdout method, time series split, and shuffle split. For each technique, it provides examples of advantages and disadvantages, along with sample Python code demonstrating implementation.

Uploaded by

Rana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

Question 1(b)

What are the advantages and disadvantages of using a specific


splitting criteria for model evaluation? Provide examples for each
case(with code) to illustrate your points.

Leave one out cross validation


Advantages Disadvantages
Makes use of nearly all data in each fold Training the model N times for N data
for maximum training efficiency. points can be computationally expensive,
especially for large datasets.

Useful when dataset is small. Can have high variance in model


performance estimates, especially for
noisy datasets.

Code:
from sklearn.model_selection import LeaveOneOut
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Sample data
X = [[1], [2], [3], [4]]
y = [0, 1, 1, 0]

# Leave-One-Out Cross-Validation
loo = LeaveOneOut()

for train_index, test_index in loo.split(X):


X_train, X_test = [X[i] for i in train_index], [X[i] for i in test_index]
y_train, y_test = [y[i] for i in train_index], [y[i] for i in test_index]

# Train and evaluate the model


model = LogisticRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy: {accuracy}")
K fold cross validation

Advantages Disadvantages
More computationally efficient than Still computationally demanding for large
LOOCV, as it involves fewer model datasets and complex models.
training iterations.

Provides a less biased and less variable Can be sensitive to the choice of K.
estimate of model performance than
LOOCV.

Flexibility in controlling bias-variance tradeoff


by adjusting the number of folds (K)

Code:
from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Sample data
X = [[1], [2], [3], [4]]
y = [0, 1, 1, 0]

# K-Fold Cross-Validation
kf = KFold(n_splits=2)

for train_index, test_index in kf.split(X):


X_train, X_test = [X[i] for i in train_index], [X[i] for i in test_index]
y_train, y_test = [y[i] for i in train_index], [y[i] for i in test_index]

# Train and evaluate the model


model = LogisticRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy: {accuracy}")

Holdout Method

Advantages Disadvantages
The implementation of this method is Can have high variance in performance
simple and it is computationally efficient. estimates, as it depends on the specific
data split.

Useful for large datasets where Might not fully utilize all available data for
computationally expensive methods are training.
impractical.

Code
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Sample data
X = [[1], [2], [3], [4]]
y = [0, 1, 1, 0]

# Holdout Method
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5,
random_state=42)

# Train and evaluate the model


model = LogisticRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy: {accuracy}")
Time Series Split
Advantages Disadvantages
Preserves temporal order in time- Not suitable for non-temporal data.
dependent data, ensuring training data
precedes testing data.

Avoids "data leakage" from future time Might not capture long-term patterns or
points into model training. trends if the validation set is too short.

‘For the given dataset we can’t apply time series split method.’

Shuffle split
Advantages Disadvantages
Introduces randomness, ensuring diverse Can disrupt patterns or relationships in
training and testing sets. data if shuffling is not appropriate.

When the dataset is small, it is beneficial


to make the most out of available
samples.

Code
from sklearn.model_selection import ShuffleSplit
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Sample data
X = [[1], [2], [3], [4]]
y = [0, 1, 1, 0]

# Shuffle Split
shuffle_split = ShuffleSplit(n_splits=2, test_size=0.5, random_state=42)

for train_index, test_index in shuffle_split.split(X):


X_train, X_test = [X[i] for i in train_index], [X[i] for i in test_index]
y_train, y_test = [y[i] for i in train_index], [y[i] for i in test_index]
# Train and evaluate the model
model = LogisticRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy: {accuracy}")

You might also like