83 Sklearn Pipeline
83 Sklearn Pipeline
May 5, 2024
data = pd.read_csv('../assets/boston.csv')
data.head()
1
3 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222.0
4 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222.0
X = data.drop('MEDV', axis=1)
y = data.MEDV
Now we can train the KNeighborsRegressor model, this model naturally makes predictions by
averaging the values of the 5 neighbors to the point that you want to predict
[ ]: # lets traing the KNeighborsRegressor
print('score:', score)
score: 0.716098217736928
Now lets go ahead and try to visualize the performance of the model. The scatter plot is of true
labels against predicted labels. Do you think the model is doing well?
[ ]: # looking at the performance
plt.scatter(y, predicted_y)
plt.title('Model Performance')
plt.xlabel('Predicted y')
plt.ylabel('True y')
plt.show()
2
0.2 Some feature selection.
Feature selection is a process where you automatically select those features in your data that
contribute most to the prediction variable or output in which you are interested.
In week 7 we learned that having irrelevant features in your data can decrease the accuracy of many
models. In the code below, we try to find out the best features that best contribute to the outcome
variable
[ ]: from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression # score function for ANOVA␣
↪with continuous outcome
# using SelectKBest
test_reg = SelectKBest(score_func=f_regression, k=6)
3
fit_boston = test_reg.fit(X, y)
indexes = fit_boston.get_support(indices=True)
print(fit_boston.scores_)
print(indexes)
4
I do not know about you, but as for me, I notice a meaningful improvement in the predictions made
from the model considering this scatter plot
5
standardized_data_num_df = pd.DataFrame(
standardized_data_num,
columns=['INDUS', 'NOX', 'RM', 'TAX', 'PTRATIO', 'LSTAT']
) # converting the standardized to dataframe
[ ]: one_hot_encoder = OneHotEncoder()
encoded_data_cat = one_hot_encoder.fit_transform(data[['CHAS', 'RAD']])
encoded_data_cat_array = encoded_data_cat.toarray()
# Get feature names
feature_names = one_hot_encoder.get_feature_names_out(['CHAS', 'RAD'])
encoded_data_cat_df = pd.DataFrame(
data=encoded_data_cat_array,
columns=feature_names
)
Let us add that to the new X and form a standardized new X set
[ ]: standardized_new_X = pd.concat(
[standardized_data_num_df, encoded_data_cat_df],
axis=1
)
6
The sklearn Pipeline allows you to sequentially apply a list of transformers to preprocess the data
and, if desired, conclude the sequence with a final predictor for predictive modeling.
Intermediate steps of the pipeline must be ‘transforms’, that is, they must implement fit and
transform methods. The final estimator only needs to implement fit.
Let us build a model that puts together transformation and modelling steps into one pipeline
object
[ ]: # lets import the Pipeline from sklearn
[ ]: # Preprocessing steps
numeric_transformer = StandardScaler()
categorical_transformer = OneHotEncoder()
# Pipeline
pipe = Pipeline([
('preprocessor', preprocessor),
('model', KNeighborsRegressor())
])
7
[ ]: plt.scatter(y, pipe_predicted_y)
plt.title('Pipe Performance')
plt.xlabel('Pipe Predicted y')
plt.ylabel('True y')
plt.show()
We can observe that the model still gets the same good score, but now all the transformation steps,
both on numeric and categorical variables are in a single pipeline object together with the model.