Assignment - Data Science
Assignment: Introduction to Data Science Concepts
Part 1: Multiple Choice Questions
Question 1: Which of the following is a type of supervised learning algorithm?
A) k-Means Clustering
B) Logistic Regression
C) Principal Component Analysis (PCA)
D) DBSCAN
Answer: ??
Question 2: Which metric is commonly used to evaluate classification models?
A) Mean Squared Error (MSE)
B) Adjusted R-squared
C) Precision
D) Silhouette Score
Answer: ??
Question 3: What is the purpose of the train_test_split function in scikit-learn?
A) To split data into k folds for cross-validation
B) To standardize the features of a dataset
C) To split the dataset into training and testing sets
D) To perform hyperparameter tuning
Answer: ??
Question 4: Which of the following techniques can be used to handle missing values in a
dataset?
A) Dropping rows with missing values
B) Imputing missing values with the mean or median
C) Using algorithms that can handle missing values
D) All of the above
Answer: ??
Question 5: In the context of machine learning, what is overfitting?
A) When a model performs well on the training data but poorly on new, unseen data
B) When a model performs equally well on both training and testing data
C) When a model has too few parameters
D) When a model uses cross-validation for evaluation
Answer: ??
Part 2: Coding Tasks
Task 1: Data Preprocessing and Exploration
Load the Iris dataset, perform basic data preprocessing, and conduct exploratory data
analysis.
python
Copy code
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Load the Iris dataset
data = pd.read_csv('iris.csv')
# Display the first few rows of the dataset
print(data.head())
# Check for missing values
print(data.isnull().sum())
# Perform basic statistical analysis
print(data.describe())
# Visualize the pairwise relationships between features
sns.pairplot(data, hue='Species')
plt.show()
Question: How many missing values are there in the Iris dataset?
A) 0
B) 5
C) 10
D) 20
Answer: ??
Task 2: Training a Logistic Regression Model
Train a logistic regression model to classify the Iris species and evaluate its performance.
python
Copy code
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
# Prepare the features and target variable
X = data[['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm']]
y = data['Species']
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train the logistic regression model
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)
# Predict on the test set
y_pred = model.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')
print('Classification Report:')
print(classification_report(y_test, y_pred))
print('Confusion Matrix:')
print(confusion_matrix(y_test, y_pred))
Question: What is the accuracy of the logistic regression model on the test set?
A) Around 0.70
B) Around 0.80
C) Around 0.90
D) Around 1.00
Answer: ??
Task 3: Hyperparameter Tuning with k-NN
Perform hyperparameter tuning on a k-NN model to find the optimal value of k using cross-
validation.
Python
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
# Define the range of k values to try
k_values = range(1, 21)
accuracy_scores = []
# Perform cross-validation for each k value
for k in k_values:
knn = KNeighborsClassifier(n_neighbors=k)
scores = cross_val_score(knn, X_train, y_train, cv=5, scoring='accuracy')
accuracy_scores.append(scores.mean())
# Plot accuracy for different values of k
plt.plot(k_values, accuracy_scores, marker='o')
plt.xlabel('k')
plt.ylabel('Cross-Validated Accuracy')
plt.title('k-NN Varying k')
plt.show()
Question: What is the optimal value of k based on cross-validation?
A) 1
B) 3
C) 5
D) 10
Answer: ??
Task 4: Evaluating k-NN Model Performance
Evaluate the performance of the k-NN model with the optimal value of k.
python
Copy code
# Train the k-NN model with the optimal value of k (assume k=5)
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
# Predict on the test set
y_pred_knn = knn.predict(X_test)
# Evaluate the model
accuracy_knn = accuracy_score(y_test, y_pred_knn)
print(f'Accuracy: {accuracy_knn}')
print('Classification Report:')
print(classification_report(y_test, y_pred_knn))
print('Confusion Matrix:')
print(confusion_matrix(y_test, y_pred_knn))
Question: What is the accuracy of the k-NN model with the optimal value of k on the test set?
A) Around 0.70
B) Around 0.80
C) Around 0.90
D) Around 1.00
Answer: ??
Assignment - Data Science