Name
CARREON, KENETH C. DS100-3 / B9 APPLIED DATA SCIENCE
WORKSHEET #6: SUPERVISED LEARNING WITH PYTHON
1 Date: January 28, 2020
Which of the following example applications of machine learning is a supervised classification problem? Answer
A. Using labeled financial data to predict whether the value of a stock will go up or go down next week
B. Using labeled housing price data to predict the price of a new house based on various features A
C. Using unlabeled data to cluster the students of an online education company into different categories
based on their learning styles
D. Using labeled financial data to predict what the value of a stock will be next week
2 Date: January 28, 2020
Import house-votes-84 (edited).csv. Write the codes necessary to import and examine this dataset. Which of the
following statements is not true?
A. The DataFrame has a total of 232 rows and 17 columns.
B. Except for ‘party’, all of the columns are of type int64.
C. The first row of the DataFrame consists of votes by a Democrat and the second row consists of votes by a Republican.
D. There are 17 predictor variables, or features, in this DataFrame.
E. The target variable in this DataFrame is ‘party’.
Code Answer
import pandas as pd
df = pd.read_csv('house-votes-84 (edited).csv')
print (df.info())
E
3 Date: January 28, 2020
Perform visual exploratory data analysis on the house votes dataset. Use Seaborn’s countplot to visualize the votes to the
satellite testing bill, grouped by party. Include the following line before the show function:
plt.xticks([0,1], [‘No’, ‘Yes’])
Do the same for the missile bill. Write the codes here and answer the question:
Of the two bills, which one/s do Democrats vote resoundingly in favor of, compared to Republicans?
A. Missile Bill C. Both Missile and Satellite Bills
B. Satellite Bill D. Neither Missile nor Satellite Bill
Code Answer
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('house-votes-84 (edited).csv')
plt.figure()
sns.countplot(x='sat_test', hue='party', data=df, palette='RdBu')
C
plt.xticks([0,1], ['No', 'Yes'])
plt.show()
Page 1 of 4
Name
CARREON, KENETH C. DS100-3 / B9 APPLIED DATA SCIENCE
WORKSHEET #6: SUPERVISED LEARNING WITH PYTHON
4 Date: January 28, 2020
Predict the party affiliation of the House member whose votes have been recorded in the file named x_new.csv. Write the code
here to achieve the following output:
Party Prediction: [‘democrat’/’republican’]
Code
import pandas as pd
df = pd.read_csv('house-votes-84 (edited).csv')
x_new = pd.read_csv('x_new.csv')
import sklearn.neighbors
from sklearn.neighbors import KNeighborsClassifier
y = df['party']
X = df.drop('party', axis=1).values
knn = KNeighborsClassifier(n_neighbors=6)
import numpy as np
y = y.reshape(-1,1)
X = X.reshape(-1,1)
knn.fit(X, y)
y_pred = knn.predict(X)
new_prediction = knn.predict('x_new')
print("Prediction: {}".format(new_prediction))
Output
Party Prediction: [‘democrat’/’republican’]
5 Date: January 28, 2020
Use train_test_split from sklearn on your House votes data. Use 70% of the data for training and the rest for testing.
Add the following arguments to train_test_split: random_state = 21, stratify = y. Print out the predictions
for the test set and the model score. Write the code here and submit a copy of the output through Cardinal Edge Worksheet
Submission.
Code Output
from sklearn import datasets
import matplotlib.pyplot as plt
digits = datasets.load_digits()
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
X = digits.data
y = digits.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=21, stratify=y)
knn = KNeighborsClassifier(n_neighbors=7)
knn.fit(X_train, y_train)
print(knn.score(X_test, y_test))
Page 2 of 4
Name
CARREON, KENETH C. DS100-3 / B9 APPLIED DATA SCIENCE
WORKSHEET #6: SUPERVISED LEARNING WITH PYTHON
6 Date: January 28, 2020
Use a for loop to determine the training accuracy and testing accuracy for the House votes data at k-values from 1 to 9. Plot the
results. Write the code here and submit a copy of the output through Cardinal Edge Worksheet Submission.
Code Output
neighbors = np.arange(1, 9)
train_accuracy = np.empty(len(neighbors))
test_accuracy = np.empty(len(neighbors))
for i, k in enumerate(neighbors):
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, y_train)
train_accuracy[i] = knn.score(X_train, y_train)
test_accuracy[i] = knn.score(X_test, y_test)
plt.title('k-NN: Varying Number of Neighbors')
plt.plot(neighbors, test_accuracy, label = 'Testing Accuracy')
plt.plot(neighbors, train_accuracy, label = 'Training Accuracy')
plt.legend()
plt.xlabel('Number of Neighbors')
plt.ylabel('Accuracy')
plt.show()
7 Date: January 25, 2020
Which of the following example applications of machine learning is best framed as a regression problem? Answer
A. An e-commerce company using labeled customer data to predict whether or not a customer will purchase
a particular item
B. A healthcare company using data about cancer tumors (such as their geometric measurements) to predict
whether a new tumor is benign or malignant C
C. A restaurant using review data to ascribe positive or negative sentiment to a given review
D. A bike share company using time and weather data to predict the number of bikes being rented at any
given hour
8 Date: January 28, 2020
Import the gapminder file. Pre-process the data by examining its features and converting the DataFrame into arrays and
reshaping them for regression. We want to see how life expectancy varies with fertility. Write the necessary codes here.
Code
import numpy as np
import pandas as pd
df = pd.read_csv('gapminder_P06.csv')
y = df['life_exp'].values
X = df['fertility'].values
print("Dimensions of y before reshaping: {}".format(y.shape))
print("Dimensions of X before reshaping: {}".format(X.shape))
y = y.reshape(-1,1)
X = X.reshape(-1,1)
print("Dimensions of y after reshaping: {}".format(y.shape))
print("Dimensions of X after reshaping: {}".format(X.shape))
Page 3 of 4
Name
CARREON, KENETH C. DS100-3 / B9 APPLIED DATA SCIENCE
WORKSHEET #6: SUPERVISED LEARNING WITH PYTHON
9 Date: January 28, 2020
Perform regression on the data (life expectancy as a function of fertility). Prepare a plot showing the data points (in blue) and the
linear model (in red). Print out the regression score. Write the code here and submit a copy of the output through Cardinal Edge
Worksheet Submission.
Code Output
import numpy as np
import pandas as pd
df = pd.read_csv('gapminder_P06.csv')
y = df['life_exp'].values
X = df['fertility'].values
y_life = y.reshape(-1,1)
X_fertility = X.reshape(-1,1)
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
prediction_space = np.linspace(min(X_fertility), max(X_fertility)).reshape(-1,1)
reg.fit(X_fertility, y_life)
y_pred = reg.predict(prediction_space)
print(reg.score(X_fertility, y_life))
plt.scatter(X_fertility, y_life, color='blue')
plt.plot(prediction_space, y_pred, color='red', linewidth=3)
plt.show()
10 Date: January 28, 2020
Perform a 5-fold cross validation on the data on the previous numbers. Write lines of code to print out the individual validation
scores and the average validation score.
Code
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
reg = LinearRegression()
cv_scores = cross_val_score(reg, X, y, cv=5)
print(cv_scores)
print("Average 5-Fold CV Score: {}".format(np.mean(cv_scores)))
Output
Page 4 of 4