0% found this document useful (0 votes)
786 views4 pages

Carreon WS06

This document contains a student's answers to multiple choice and coding questions about supervised machine learning techniques in Python. It includes examples of classification using k-nearest neighbors on House voting data and linear regression using life expectancy and fertility data from Gapminder. The student performs tasks like data preprocessing, training and testing models, evaluating accuracy at different values of k, and 5-fold cross-validation.

Uploaded by

Keneth Carreon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
786 views4 pages

Carreon WS06

This document contains a student's answers to multiple choice and coding questions about supervised machine learning techniques in Python. It includes examples of classification using k-nearest neighbors on House voting data and linear regression using life expectancy and fertility data from Gapminder. The student performs tasks like data preprocessing, training and testing models, evaluating accuracy at different values of k, and 5-fold cross-validation.

Uploaded by

Keneth Carreon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Name

CARREON, KENETH C. DS100-3 / B9 APPLIED DATA SCIENCE

WORKSHEET #6: SUPERVISED LEARNING WITH PYTHON

1 Date: January 28, 2020


Which of the following example applications of machine learning is a supervised classification problem? Answer
A. Using labeled financial data to predict whether the value of a stock will go up or go down next week
B. Using labeled housing price data to predict the price of a new house based on various features A
C. Using unlabeled data to cluster the students of an online education company into different categories
based on their learning styles
D. Using labeled financial data to predict what the value of a stock will be next week

2 Date: January 28, 2020


Import house-votes-84 (edited).csv. Write the codes necessary to import and examine this dataset. Which of the
following statements is not true?
A. The DataFrame has a total of 232 rows and 17 columns.
B. Except for ‘party’, all of the columns are of type int64.
C. The first row of the DataFrame consists of votes by a Democrat and the second row consists of votes by a Republican.
D. There are 17 predictor variables, or features, in this DataFrame.
E. The target variable in this DataFrame is ‘party’.
Code Answer

import pandas as pd
df = pd.read_csv('house-votes-84 (edited).csv')
print (df.info())
E

3 Date: January 28, 2020


Perform visual exploratory data analysis on the house votes dataset. Use Seaborn’s countplot to visualize the votes to the
satellite testing bill, grouped by party. Include the following line before the show function:
plt.xticks([0,1], [‘No’, ‘Yes’])
Do the same for the missile bill. Write the codes here and answer the question:
Of the two bills, which one/s do Democrats vote resoundingly in favor of, compared to Republicans?
A. Missile Bill C. Both Missile and Satellite Bills
B. Satellite Bill D. Neither Missile nor Satellite Bill
Code Answer

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('house-votes-84 (edited).csv')
plt.figure()
sns.countplot(x='sat_test', hue='party', data=df, palette='RdBu')
C
plt.xticks([0,1], ['No', 'Yes'])
plt.show()

Page 1 of 4
Name
CARREON, KENETH C. DS100-3 / B9 APPLIED DATA SCIENCE

WORKSHEET #6: SUPERVISED LEARNING WITH PYTHON

4 Date: January 28, 2020


Predict the party affiliation of the House member whose votes have been recorded in the file named x_new.csv. Write the code
here to achieve the following output:
Party Prediction: [‘democrat’/’republican’]
Code

import pandas as pd
df = pd.read_csv('house-votes-84 (edited).csv')
x_new = pd.read_csv('x_new.csv')
import sklearn.neighbors
from sklearn.neighbors import KNeighborsClassifier
y = df['party']
X = df.drop('party', axis=1).values
knn = KNeighborsClassifier(n_neighbors=6)
import numpy as np
y = y.reshape(-1,1)
X = X.reshape(-1,1)
knn.fit(X, y)
y_pred = knn.predict(X)
new_prediction = knn.predict('x_new')
print("Prediction: {}".format(new_prediction))

Output

Party Prediction: [‘democrat’/’republican’]

5 Date: January 28, 2020


Use train_test_split from sklearn on your House votes data. Use 70% of the data for training and the rest for testing.
Add the following arguments to train_test_split: random_state = 21, stratify = y. Print out the predictions
for the test set and the model score. Write the code here and submit a copy of the output through Cardinal Edge Worksheet
Submission.
Code Output

from sklearn import datasets


import matplotlib.pyplot as plt
digits = datasets.load_digits()
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
X = digits.data
y = digits.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=21, stratify=y)
knn = KNeighborsClassifier(n_neighbors=7)
knn.fit(X_train, y_train)
print(knn.score(X_test, y_test))

Page 2 of 4
Name
CARREON, KENETH C. DS100-3 / B9 APPLIED DATA SCIENCE

WORKSHEET #6: SUPERVISED LEARNING WITH PYTHON

6 Date: January 28, 2020


Use a for loop to determine the training accuracy and testing accuracy for the House votes data at k-values from 1 to 9. Plot the
results. Write the code here and submit a copy of the output through Cardinal Edge Worksheet Submission.
Code Output

neighbors = np.arange(1, 9)
train_accuracy = np.empty(len(neighbors))
test_accuracy = np.empty(len(neighbors))
for i, k in enumerate(neighbors):
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, y_train)
train_accuracy[i] = knn.score(X_train, y_train)
test_accuracy[i] = knn.score(X_test, y_test)
plt.title('k-NN: Varying Number of Neighbors')
plt.plot(neighbors, test_accuracy, label = 'Testing Accuracy')
plt.plot(neighbors, train_accuracy, label = 'Training Accuracy')
plt.legend()
plt.xlabel('Number of Neighbors')
plt.ylabel('Accuracy')
plt.show()

7 Date: January 25, 2020


Which of the following example applications of machine learning is best framed as a regression problem? Answer
A. An e-commerce company using labeled customer data to predict whether or not a customer will purchase
a particular item
B. A healthcare company using data about cancer tumors (such as their geometric measurements) to predict
whether a new tumor is benign or malignant C
C. A restaurant using review data to ascribe positive or negative sentiment to a given review
D. A bike share company using time and weather data to predict the number of bikes being rented at any
given hour

8 Date: January 28, 2020


Import the gapminder file. Pre-process the data by examining its features and converting the DataFrame into arrays and
reshaping them for regression. We want to see how life expectancy varies with fertility. Write the necessary codes here.
Code
import numpy as np
import pandas as pd
df = pd.read_csv('gapminder_P06.csv')
y = df['life_exp'].values
X = df['fertility'].values
print("Dimensions of y before reshaping: {}".format(y.shape))
print("Dimensions of X before reshaping: {}".format(X.shape))
y = y.reshape(-1,1)
X = X.reshape(-1,1)
print("Dimensions of y after reshaping: {}".format(y.shape))
print("Dimensions of X after reshaping: {}".format(X.shape))
Page 3 of 4
Name
CARREON, KENETH C. DS100-3 / B9 APPLIED DATA SCIENCE

WORKSHEET #6: SUPERVISED LEARNING WITH PYTHON

9 Date: January 28, 2020


Perform regression on the data (life expectancy as a function of fertility). Prepare a plot showing the data points (in blue) and the
linear model (in red). Print out the regression score. Write the code here and submit a copy of the output through Cardinal Edge
Worksheet Submission.
Code Output

import numpy as np
import pandas as pd
df = pd.read_csv('gapminder_P06.csv')
y = df['life_exp'].values
X = df['fertility'].values
y_life = y.reshape(-1,1)
X_fertility = X.reshape(-1,1)
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
prediction_space = np.linspace(min(X_fertility), max(X_fertility)).reshape(-1,1)
reg.fit(X_fertility, y_life)
y_pred = reg.predict(prediction_space)
print(reg.score(X_fertility, y_life))
plt.scatter(X_fertility, y_life, color='blue')
plt.plot(prediction_space, y_pred, color='red', linewidth=3)
plt.show()

10 Date: January 28, 2020


Perform a 5-fold cross validation on the data on the previous numbers. Write lines of code to print out the individual validation
scores and the average validation score.
Code

from sklearn.linear_model import LinearRegression


from sklearn.model_selection import cross_val_score
reg = LinearRegression()
cv_scores = cross_val_score(reg, X, y, cv=5)
print(cv_scores)
print("Average 5-Fold CV Score: {}".format(np.mean(cv_scores)))

Output

Page 4 of 4

You might also like