0% found this document useful (0 votes)
1 views

Assignment - Data Science Concepts

The document outlines an assignment on data science concepts, including multiple choice questions related to supervised learning, evaluation metrics, and overfitting. It also includes coding tasks involving data preprocessing, training a logistic regression model, hyperparameter tuning for k-NN, and evaluating model performance. Each section contains questions that require answers based on the tasks performed.

Uploaded by

sarvadnya mense
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

Assignment - Data Science Concepts

The document outlines an assignment on data science concepts, including multiple choice questions related to supervised learning, evaluation metrics, and overfitting. It also includes coding tasks involving data preprocessing, training a logistic regression model, hyperparameter tuning for k-NN, and evaluating model performance. Each section contains questions that require answers based on the tasks performed.

Uploaded by

sarvadnya mense
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Assignment - Data Science

Assignment: Introduction to Data Science Concepts

Part 1: Multiple Choice Questions

Question 1: Which of the following is a type of supervised learning algorithm?

A) k-Means Clustering
B) Logistic Regression
C) Principal Component Analysis (PCA)
D) DBSCAN

Answer: ??

Question 2: Which metric is commonly used to evaluate classification models?

A) Mean Squared Error (MSE)


B) Adjusted R-squared
C) Precision
D) Silhouette Score

Answer: ??

Question 3: What is the purpose of the train_test_split function in scikit-learn?

A) To split data into k folds for cross-validation


B) To standardize the features of a dataset
C) To split the dataset into training and testing sets
D) To perform hyperparameter tuning

Answer: ??

Question 4: Which of the following techniques can be used to handle missing values in a
dataset?

A) Dropping rows with missing values


B) Imputing missing values with the mean or median
C) Using algorithms that can handle missing values
D) All of the above
Answer: ??

Question 5: In the context of machine learning, what is overfitting?

A) When a model performs well on the training data but poorly on new, unseen data
B) When a model performs equally well on both training and testing data
C) When a model has too few parameters
D) When a model uses cross-validation for evaluation

Answer: ??

Part 2: Coding Tasks

Task 1: Data Preprocessing and Exploration

Load the Iris dataset, perform basic data preprocessing, and conduct exploratory data
analysis.

python

Copy code

import pandas as pd

import seaborn as sns

import matplotlib.pyplot as plt

# Load the Iris dataset

data = pd.read_csv('iris.csv')

# Display the first few rows of the dataset

print(data.head())

# Check for missing values

print(data.isnull().sum())

# Perform basic statistical analysis

print(data.describe())
# Visualize the pairwise relationships between features

sns.pairplot(data, hue='Species')

plt.show()

Question: How many missing values are there in the Iris dataset?

A) 0
B) 5
C) 10
D) 20

Answer: ??

Task 2: Training a Logistic Regression Model

Train a logistic regression model to classify the Iris species and evaluate its performance.

python

Copy code

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Prepare the features and target variable

X = data[['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm']]

y = data['Species']

# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the logistic regression model

model = LogisticRegression(max_iter=200)

model.fit(X_train, y_train)
# Predict on the test set

y_pred = model.predict(X_test)

# Evaluate the model

accuracy = accuracy_score(y_test, y_pred)

print(f'Accuracy: {accuracy}')

print('Classification Report:')

print(classification_report(y_test, y_pred))

print('Confusion Matrix:')

print(confusion_matrix(y_test, y_pred))

Question: What is the accuracy of the logistic regression model on the test set?

A) Around 0.70
B) Around 0.80
C) Around 0.90
D) Around 1.00

Answer: ??

Task 3: Hyperparameter Tuning with k-NN

Perform hyperparameter tuning on a k-NN model to find the optimal value of k using cross-
validation.

Python

from sklearn.neighbors import KNeighborsClassifier

from sklearn.model_selection import cross_val_score

# Define the range of k values to try

k_values = range(1, 21)

accuracy_scores = []
# Perform cross-validation for each k value

for k in k_values:

knn = KNeighborsClassifier(n_neighbors=k)

scores = cross_val_score(knn, X_train, y_train, cv=5, scoring='accuracy')

accuracy_scores.append(scores.mean())

# Plot accuracy for different values of k

plt.plot(k_values, accuracy_scores, marker='o')

plt.xlabel('k')

plt.ylabel('Cross-Validated Accuracy')

plt.title('k-NN Varying k')

plt.show()

Question: What is the optimal value of k based on cross-validation?

A) 1
B) 3
C) 5
D) 10

Answer: ??

Task 4: Evaluating k-NN Model Performance

Evaluate the performance of the k-NN model with the optimal value of k.

python

Copy code

# Train the k-NN model with the optimal value of k (assume k=5)

knn = KNeighborsClassifier(n_neighbors=5)

knn.fit(X_train, y_train)

# Predict on the test set


y_pred_knn = knn.predict(X_test)

# Evaluate the model

accuracy_knn = accuracy_score(y_test, y_pred_knn)

print(f'Accuracy: {accuracy_knn}')

print('Classification Report:')

print(classification_report(y_test, y_pred_knn))

print('Confusion Matrix:')

print(confusion_matrix(y_test, y_pred_knn))

Question: What is the accuracy of the k-NN model with the optimal value of k on the test set?

A) Around 0.70
B) Around 0.80
C) Around 0.90
D) Around 1.00

Answer: ??

Assignment - Data Science

You might also like