0% found this document useful (0 votes)
33 views10 pages

End To End Machine Learning Project-2

Uploaded by

egorboy2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views10 pages

End To End Machine Learning Project-2

Uploaded by

egorboy2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

End-to-End Machine Learning Project:

A Step-by-Step Guide
Julia Wieczorek
August 1, 2024

1 Overview
In this tutorial, we will walk through a complete machine learning project using the
California housing dataset. The objective is to predict housing prices based on various
features. This project involves the following key steps:

1. Frame the Problem

2. Get the Data

3. Explore the Data

4. Prepare the Data for Machine Learning Algorithms

5. Select and Train a Model

6. Fine-tune the Model

7. Present the Solution

8. Launch, Monitor, and Maintain the System

Let’s dive into each step with detailed instructions.

2 Frame the Problem


Objective: Predict median housing prices in California districts using various features.

2.1 Instructions
1. Define the Objective:

• Explain the business problem: Predict the housing prices for better decision-
making.
• Identify the target variable: median house value.

2. Performance Measure:

1
• Use Root Mean Square Error (RMSE) as the performance metric.
3. Assumptions:
• Highlight any assumptions about the data or project scope.

3 Get the Data


Objective: Access and load the California housing dataset.

3.1 Instructions
1. Import Libraries:
1 import os
2 import tarfile
3 import urllib . request
4 import pandas as pd
Listing 1: Import necessary libraries

2. Fetch the Data:


1 DOWNLOAD_ROOT = " https :// raw . githu buserc ontent . com / ageron /
,→ handson - ml2 / master / "
2 HOUSING_PATH = os . path . join ( " datasets " , " housing " )
3 HOUSING_URL = DOWNLOAD_ROOT + " datasets / housing / housing . tgz "
4

5 def fe tc h_ ho usi ng _d at a ( housing_url = HOUSING_URL , housing_path =


,→ HOUSING_PATH ) :
6 os . makedirs ( housing_path , exist_ok = True )
7 tgz_path = os . path . join ( housing_path , " housing . tgz " )
8 urllib . request . urlretrieve ( housing_url , tgz_path )
9 housing_tgz = tarfile . open ( tgz_path )
10 housing_tgz . extractall ( path = housing_path )
11 housing_tgz . close ()
Listing 2: Fetch the housing data

3. Load the Data:


1 def l oad_ho using_ data ( housing_path = HOUSING_PATH ) :
2 csv_path = os . path . join ( housing_path , " housing . csv " )
3 return pd . read_csv ( csv_path )
4

5 housing = lo ad_ho using_ data ()


Listing 3: Load the housing data

4 Explore the Data


Objective: Understand the structure and characteristics of the data.

2
4.1 Instructions
1. Examine the Data:
1 housing . head ()
Listing 4: View the first few rows of the data

• Discuss the meaning of each attribute.

2. Check for Missing Values:


1 housing . info ()
Listing 5: Check for missing values

3. View Numerical Attribute Summary:


1 housing . describe ()
Listing 6: Summary statistics for numerical attributes

4. Visualize the Data:

(a) Histograms:
1 import matplotlib . pyplot as plt
2 housing . hist ( bins =50 , figsize =(20 ,15) )
3 plt . show ()
Listing 7: Plot histograms for numerical attributes

(b) Geographical Scatter Plot:


1 housing . plot ( kind = " scatter " , x = " longitude " , y = " latitude " ,
,→ alpha =0.4 ,
2 s = housing [ " population " ]/100 , label = "
,→ population " , figsize =(10 ,7) ,
3 c = " me di an _h ou se _v al ue " , cmap = plt . get_cmap ( "
,→ jet " ) , colorbar = True )
4 plt . legend ()
Listing 8: Visualize geographical data with scatter plot

5 Prepare the Data for Machine Learning Algorithms


Objective: Clean and transform the data to make it suitable for machine learning
algorithms.

3
5.1 Instructions
1. Create a Test Set:
1 import numpy as np
2

3 def split_train_test ( data , test_ratio ) :


4 np . random . seed (42)
5 shuffled_indices = np . random . permutation ( len ( data ) )
6 test_set_size = int ( len ( data ) * test_ratio )
7 test_indices = shuffled_indices [: test_set_size ]
8 train_indices = shuffled_indices [ test_set_size :]
9 return data . iloc [ train_indices ] , data . iloc [ test_indices ]
10

11 train_set , test_set = split_train_test ( housing , 0.2)


Listing 9: Split the data into train and test sets

2. Stratified Sampling Based on Income Category:


1 housing [ " income_cat " ] = pd . cut ( housing [ " median_income " ] ,
2 bins =[0. , 1.5 , 3.0 , 4.5 , 6. ,
,→ np . inf ] ,
3 labels =[1 , 2 , 3 , 4 , 5])
4

5 from sklearn . model_selection import S t r a t i f i e d S h u f f l e S p l i t


6

7 split = S t r a t i f i e d S h u f f l e S p l i t ( n_splits =1 , test_size =0.2 ,


,→ random_state =42)
8 for train_index , test_index in split . split ( housing , housing [ "
,→ income_cat " ]) :
9 strat_train_set = housing . loc [ train_index ]
10 strat_test_set = housing . loc [ test_index ]
Listing 10: Perform stratified sampling based on income

3. Visualize Stratified Data:


1 strat_test_set [ " income_cat " ]. value_counts () / len (
,→ strat_test_set )
Listing 11: Check the distribution of stratified data

4. Drop the Income Category:


1 for set_ in ( strat_train_set , strat_test_set ) :
2 set_ . drop ( " income_cat " , axis =1 , inplace = True )
Listing 12: Remove the income category column

5. Data Cleaning:

• Handle missing values:

4
1 from sklearn . impute import SimpleImputer
2

3 imputer = SimpleImputer ( strategy = " median " )


4 housing_num = strat_train_set . drop ( " ocean_proximity " ,
,→ axis =1)
5 imputer . fit ( housing_num )
6 X = imputer . transform ( housing_num )
7 housing_tr = pd . DataFrame (X , columns = housing_num . columns ,
,→ index = housing_num . index )
Listing 13: Handle missing data using SimpleImputer

6. Handle Text and Categorical Attributes:


1 housing_cat = strat_train_set [[ " ocean_proximity " ]]
2 housing_cat_encoded , hou si ng _c at eg or ie s = housing_cat .
,→ factorize ()
3 from sklearn . preprocessing import OneHotEncoder
4

5 encoder = OneHotEncoder ()
6 housing_cat_1hot = encoder . fit_transform ( housing_cat )
Listing 14: Encode categorical attributes

7. Feature Scaling:
1 from sklearn . preprocessing import StandardScaler
2

3 scaler = StandardScaler ()
4 hous ing_tr _scale d = scaler . fit_transform ( housing_tr )
Listing 15: Scale the numerical features

8. Custom Transformers (Optional):


1 from sklearn . base import BaseEstimator , TransformerMixin
2

3 # Define column indices


4 rooms_ix , bedrooms_ix , population_ix , household_ix = 3 , 4 , 5 ,
,→ 6
5

6 class C o m b i n e d A t t r i b u t e s A d d e r ( BaseEstimator , TransformerMixin


,→ ) :
7 def __init__ ( self , a d d _ b e d r o o m s _ p e r _ r o o m = True ) : # no *
,→ args or ** kargs
8 self . a d d _ b e d r o o m s _ p e r _ r o o m = a d d _ b e d r o o m s _ p e r _ r o o m
9 def fit ( self , X , y = None ) :
10 return self # nothing else to do
11 def transform ( self , X ) :
12 r o om s _ pe r _ ho u s eh o l d = X [: , rooms_ix ] / X [: ,
,→ household_ix ]
13 p o p u l a t i o n _ p e r _ h o u s e h o l d = X [: , population_ix ] / X [: ,
,→ household_ix ]

5
14 if self . a d d _ b e d r o o m s _ p e r _ r o o m :
15 bedr ooms_p er_roo m = X [: , bedrooms_ix ] / X [: ,
,→ rooms_ix ]
16 return np . c_ [X , rooms_per_household ,
,→ population_per_household ,
17 bedr ooms_ per_ro om ]
18 else :
19 return np . c_ [X , rooms_per_household ,
,→ p o p u l a t i o n _ p e r _ h o u s e h o l d ]
20

21 attr_adder = C o m b i n e d A t t r i b u t e s A d d e r ( a d d _ b e d r o o m s _ p e r _ r o o m =
,→ False )
22 hou s i n g _ e x t r a _ a t t r i b s = attr_adder . transform ( housing . values )
Listing 16: Create custom transformers for additional features

9. Transformation Pipelines:
1 from sklearn . pipeline import Pipeline
2 from sklearn . compose import Co lumnTr ansfor mer
3

4 num_pipeline = Pipeline ([
5 ( ’ imputer ’ , SimpleImputer ( strategy = " median " ) ) ,
6 ( ’ attribs_adder ’ , C o m b i n e d A t t r i b u t e s A d d e r () ) ,
7 ( ’ std_scaler ’ , StandardScaler () ) ,
8 ])
9

10 num_attribs = list ( housing_num )


11 cat_attribs = [ " ocean_proximity " ]
12

13 full_pipeline = Column Transf ormer ([


14 ( " num " , num_pipeline , num_attribs ) ,
15 ( " cat " , OneHotEncoder () , cat_attribs ) ,
16 ])
17

18 housing_prepared = full_pipeline . fit_transform (


,→ strat_train_set )
Listing 17: Create a data transformation pipeline

6 Select and Train a Model


Objective: Choose an appropriate machine learning model and train it.

6.1 Instructions
1. Train a Linear Regression Model:
1 from sklearn . linear_model import LinearRegression
2

3 lin_reg = LinearRegression ()

6
4 lin_reg . fit ( housing_prepared , strat_train_set [ "
,→ m edi an _h ou se _v al ue " ])
Listing 18: Train a Linear Regression model

2. Evaluate the Model:


1 from sklearn . metrics import mea n_ sq ua re d_ er ro r
2

3 hous i n g_ p r ed i c ti o n s = lin_reg . predict ( housing_prepared )


4 lin_mse = me an _sq ua re d_ er ro r ( strat_train_set [ "
,→ m edi an _h ou se _v al ue " ] , h o u si n g _p r e di c t io n s )
5 lin_rmse = np . sqrt ( lin_mse )
6 print ( " Linear Regression RMSE : " , lin_rmse )
Listing 19: Evaluate the Linear Regression model

3. Train a Decision Tree Model:


1 from sklearn . tree import D e c i s i o n T r e e R e g r e s s o r
2

3 tree_reg = D e c i s i o n T r e e R e g r e s s o r ( random_state =42)


4 tree_reg . fit ( housing_prepared , strat_train_set [ "
,→ m edi an _h ou se _v al ue " ])
Listing 20: Train a Decision Tree model

4. Evaluate the Decision Tree Model:


1 hous i n g_ p r ed i c ti o n s = tree_reg . predict ( housing_prepared )
2 tree_mse = m ea n_ squ ar ed _e rr or ( strat_train_set [ "
,→ m edi an _h ou se _v al ue " ] , h o u si n g _p r e di c t io n s )
3 tree_rmse = np . sqrt ( tree_mse )
4 print ( " Decision Tree RMSE : " , tree_rmse )
Listing 21: Evaluate the Decision Tree model

5. Cross-Validation for Better Evaluation:


1 from sklearn . model_selection import cross_val_score
2

3 scores = cross_val_score ( tree_reg , housing_prepared ,


,→ strat_train_set [ " m edi an _h ou se _v al ue " ] ,
4 scoring = " n e g _ m e a n _ s q u a r e d _ e r r o r " , cv
,→ =10)
5 tree_rmse_scores = np . sqrt ( - scores )
6

7 def display_scores ( scores ) :


8 print ( " Scores : " , scores )
9 print ( " Mean : " , scores . mean () )
10 print ( " Standard deviation : " , scores . std () )
11

12 display_scores ( tree_rmse_scores )
Listing 22: Use cross-validation for model evaluation

7
7 Fine-tune the Model
Objective: Optimize model performance through hyperparameter tuning.

7.1 Instructions
1. Grid Search:
1 from sklearn . model_selection import GridSearchCV
2

3 param_grid = [
4 { ’ n_estimators ’: [3 , 10 , 30] , ’ max_features ’: [2 , 4 , 6 ,
,→ 8]} ,
5 { ’ bootstrap ’: [ False ] , ’ n_estimators ’: [3 , 10] , ’
,→ max_features ’: [2 , 3 , 4]} ,
6 ]
7

8 forest_reg = R a n d o m F o r e s t R e g r e s s o r ( random_state =42)


9 grid_search = GridSearchCV ( forest_reg , param_grid , cv =5 ,
10 scoring = ’ n e g _ m e a n _ s q u a r e d _ e r r o r ’ ,
11 re tu rn _t ra in _s co re = True )
12 grid_search . fit ( housing_prepared , strat_train_set [ "
,→ m edi an _h ou se _v al ue " ])
Listing 23: Use Grid Search for hyperparameter tuning

2. Analyze the Best Parameters and Scores:


1 grid_search . best_params_
2 grid_search . best_estimator_
Listing 24: Analyze the best parameters and scores from Grid Search

3. Evaluate Feature Importance:


1 feat u r e_ i m po r t an c e s = grid_search . best_estimator_ .
,→ f e a t u r e _ i m p o r t a n c e s _
2 extra_attribs = [ " rooms_per_hhold " , " pop_per_hhold " , "
,→ bed rooms_ per_ro om " ]
3 cat_encoder = full_pipeline . na m e d_ t r an s f or m e rs _ [ " cat " ]
4 cat_ o n e_ h o t_ a t tr i b s = list ( cat_encoder . categories_ [0])
5 attributes = num_attribs + extra_attribs +
,→ c a t_ o n e_ h o t_ a t tr i b s
6 sorted ( zip ( feature_importances , attributes ) , reverse = True )
Listing 25: Evaluate the importance of each feature

8 Present the Solution


Objective: Prepare the model for presentation and deployment.

8
8.1 Instructions
1. Evaluate the Model on Test Set:
1 final_model = grid_search . best_estimator_
2

3 X_test = strat_test_set . drop ( " m ed ia n_ ho us e_ va lu e " , axis =1)


4 y_test = strat_test_set [ " m ed ia n_ ho us e_ va lu e " ]. copy ()
5

6 X_test_prepared = full_pipeline . transform ( X_test )


7 fina l_pred iction s = final_model . predict ( X_test_prepared )
8

9 final_mse = me an _s qu ar ed _e rr or ( y_test , final _predi ctions )


10 final_rmse = np . sqrt ( final_mse )
11 print ( " Final RMSE on Test Set : " , final_rmse )
Listing 26: Evaluate the final model on the test set

2. Document the Results:

• Prepare a report with key findings, model performance, and next steps.

3. Create Visualizations (if applicable):


1 import matplotlib . pyplot as plt
2 plt . scatter ( y_test , f inal_p redict ions )
3 plt . xlabel ( " Actual Values " )
4 plt . ylabel ( " Predicted Values " )
5 plt . title ( " Predicted vs Actual Values " )
6 plt . plot ([0 , 500000] , [0 , 500000] , color = " red " , linewidth =2)
7 plt . show ()
Listing 27: Create visualizations to present model results

9 Launch, Monitor, and Maintain the System


Objective: Deploy the model and ensure its performance in a production environment.

9.1 Instructions
1. Deployment:

• Explain how to deploy the model using Flask, FastAPI, or a cloud platform.
• Discuss the importance of monitoring performance and retraining as needed.

2. Monitoring:

• Implement logging and monitoring to track model performance.

3. Maintenance:

• Schedule regular maintenance checks to ensure data integrity and model ac-
curacy.

9
10 Conclusion
This guide has walked you through the entire process of building a machine learning
model from scratch, including data preparation, model selection, training, evaluation,
and deployment. By following these steps, you can develop a robust machine learning
solution for predicting California housing prices.

References
[1] Aurélien Géron, Hands-On Machine Learning with Scikit-Learn, Keras, and Tensor-
Flow, 2nd Edition, O’Reilly Media, 2019.

10

You might also like