End To End Machine Learning Project-2
End To End Machine Learning Project-2
A Step-by-Step Guide
Julia Wieczorek
August 1, 2024
1 Overview
In this tutorial, we will walk through a complete machine learning project using the
California housing dataset. The objective is to predict housing prices based on various
features. This project involves the following key steps:
2.1 Instructions
1. Define the Objective:
• Explain the business problem: Predict the housing prices for better decision-
making.
• Identify the target variable: median house value.
2. Performance Measure:
1
• Use Root Mean Square Error (RMSE) as the performance metric.
3. Assumptions:
• Highlight any assumptions about the data or project scope.
3.1 Instructions
1. Import Libraries:
1 import os
2 import tarfile
3 import urllib . request
4 import pandas as pd
Listing 1: Import necessary libraries
2
4.1 Instructions
1. Examine the Data:
1 housing . head ()
Listing 4: View the first few rows of the data
(a) Histograms:
1 import matplotlib . pyplot as plt
2 housing . hist ( bins =50 , figsize =(20 ,15) )
3 plt . show ()
Listing 7: Plot histograms for numerical attributes
3
5.1 Instructions
1. Create a Test Set:
1 import numpy as np
2
5. Data Cleaning:
4
1 from sklearn . impute import SimpleImputer
2
5 encoder = OneHotEncoder ()
6 housing_cat_1hot = encoder . fit_transform ( housing_cat )
Listing 14: Encode categorical attributes
7. Feature Scaling:
1 from sklearn . preprocessing import StandardScaler
2
3 scaler = StandardScaler ()
4 hous ing_tr _scale d = scaler . fit_transform ( housing_tr )
Listing 15: Scale the numerical features
5
14 if self . a d d _ b e d r o o m s _ p e r _ r o o m :
15 bedr ooms_p er_roo m = X [: , bedrooms_ix ] / X [: ,
,→ rooms_ix ]
16 return np . c_ [X , rooms_per_household ,
,→ population_per_household ,
17 bedr ooms_ per_ro om ]
18 else :
19 return np . c_ [X , rooms_per_household ,
,→ p o p u l a t i o n _ p e r _ h o u s e h o l d ]
20
21 attr_adder = C o m b i n e d A t t r i b u t e s A d d e r ( a d d _ b e d r o o m s _ p e r _ r o o m =
,→ False )
22 hou s i n g _ e x t r a _ a t t r i b s = attr_adder . transform ( housing . values )
Listing 16: Create custom transformers for additional features
9. Transformation Pipelines:
1 from sklearn . pipeline import Pipeline
2 from sklearn . compose import Co lumnTr ansfor mer
3
4 num_pipeline = Pipeline ([
5 ( ’ imputer ’ , SimpleImputer ( strategy = " median " ) ) ,
6 ( ’ attribs_adder ’ , C o m b i n e d A t t r i b u t e s A d d e r () ) ,
7 ( ’ std_scaler ’ , StandardScaler () ) ,
8 ])
9
6.1 Instructions
1. Train a Linear Regression Model:
1 from sklearn . linear_model import LinearRegression
2
3 lin_reg = LinearRegression ()
6
4 lin_reg . fit ( housing_prepared , strat_train_set [ "
,→ m edi an _h ou se _v al ue " ])
Listing 18: Train a Linear Regression model
12 display_scores ( tree_rmse_scores )
Listing 22: Use cross-validation for model evaluation
7
7 Fine-tune the Model
Objective: Optimize model performance through hyperparameter tuning.
7.1 Instructions
1. Grid Search:
1 from sklearn . model_selection import GridSearchCV
2
3 param_grid = [
4 { ’ n_estimators ’: [3 , 10 , 30] , ’ max_features ’: [2 , 4 , 6 ,
,→ 8]} ,
5 { ’ bootstrap ’: [ False ] , ’ n_estimators ’: [3 , 10] , ’
,→ max_features ’: [2 , 3 , 4]} ,
6 ]
7
8
8.1 Instructions
1. Evaluate the Model on Test Set:
1 final_model = grid_search . best_estimator_
2
• Prepare a report with key findings, model performance, and next steps.
9.1 Instructions
1. Deployment:
• Explain how to deploy the model using Flask, FastAPI, or a cloud platform.
• Discuss the importance of monitoring performance and retraining as needed.
2. Monitoring:
3. Maintenance:
• Schedule regular maintenance checks to ensure data integrity and model ac-
curacy.
9
10 Conclusion
This guide has walked you through the entire process of building a machine learning
model from scratch, including data preparation, model selection, training, evaluation,
and deployment. By following these steps, you can develop a robust machine learning
solution for predicting California housing prices.
References
[1] Aurélien Géron, Hands-On Machine Learning with Scikit-Learn, Keras, and Tensor-
Flow, 2nd Edition, O’Reilly Media, 2019.
10