0% found this document useful (0 votes)
54 views1 page

Import As Import As From Import: "Mean Squared Errors: "

This document shows code for loading and preprocessing the California housing dataset using scikit-learn. It splits the data into training and test sets, adds an intercept term, fits a linear regression model using the closed-form solution, and calculates the mean squared error on the test set. Key steps include: 1. Loading the California housing dataset and separating features (X) and target (y) 2. Standardizing the features 3. Splitting into training and test sets 4. Adding an intercept term to the training and test features 5. Computing the weights using the closed-form linear regression solution 6. Calculating the mean squared error on the test set

Uploaded by

ul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views1 page

Import As Import As From Import: "Mean Squared Errors: "

This document shows code for loading and preprocessing the California housing dataset using scikit-learn. It splits the data into training and test sets, adds an intercept term, fits a linear regression model using the closed-form solution, and calculates the mean squared error on the test set. Key steps include: 1. Loading the California housing dataset and separating features (X) and target (y) 2. Standardizing the features 3. Splitting into training and test sets 4. Adding an intercept term to the training and test features 5. Computing the weights using the closed-form linear regression solution 6. Calculating the mean squared error on the test set

Uploaded by

ul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

In 

[6]: import pandas as pd


import numpy as np
from sklearn import datasets

housing = datasets.fetch_california_housing()
housing

Out[6]: {'data': array([[ 8.3252 , 41. , 6.98412698, ..., 2.55555556,


37.88 , -122.23 ],
[ 8.3014 , 21. , 6.23813708, ..., 2.10984183,
37.86 , -122.22 ],
[ 7.2574 , 52. , 8.28813559, ..., 2.80225989,
37.85 , -122.24 ],
...,
[ 1.7 , 17. , 5.20554273, ..., 2.3256351 ,
39.43 , -121.22 ],
[ 1.8672 , 18. , 5.32951289, ..., 2.12320917,
39.43 , -121.32 ],
[ 2.3886 , 16. , 5.25471698, ..., 2.61698113,
39.37 , -121.24 ]]),
'target': array([4.526, 3.585, 3.521, ..., 0.923, 0.847, 0.894]),
'frame': None,
'target_names': ['MedHouseVal'],
'feature_names': ['MedInc',
'HouseAge',
'AveRooms',
'AveBedrms',
'Population',
'AveOccup',
'Latitude',
'Longitude'],
'DESCR': '.. _california_housing_dataset:\n\nCalifornia Housing dataset\n--------------------------\n\n**Data Set Characteristics:**\n\n :Number of Instances: 20
640\n\n :Number of Attributes: 8 numeric, predictive attributes and the target\n\n :Attribute Information:\n - MedInc median income in block grou
p\n - HouseAge median house age in block group\n - AveRooms average number of rooms per household\n - AveBedrms average number of
bedrooms per household\n - Population block group population\n - AveOccup average number of household members\n - Latitude block gr
oup latitude\n - Longitude block group longitude\n\n :Missing Attribute Values: None\n\nThis dataset was obtained from the StatLib repository.\nhttps:/
/www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html\n\nThe target variable is the median house value for California districts,\nexpressed in hundreds of thousands
of dollars ($100,000).\n\nThis dataset was derived from the 1990 U.S. census, using one row per census\nblock group. A block group is the smallest geographical unit
for which the U.S.\nCensus Bureau publishes sample data (a block group typically has a population\nof 600 to 3,000 people).\n\nAn household is a group of people resi
ding within a home. Since the average\nnumber of rooms and bedrooms in this dataset are provided per household, these\ncolumns may take surpinsingly large values for
block groups with few households\nand many empty houses, such as vacation resorts.\n\nIt can be downloaded/loaded using the\n:func:`sklearn.datasets.fetch_california
_housing` function.\n\n.. topic:: References\n\n - Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions,\n Statistics and Probability Letters, 33
(1997) 291-297\n'}

In [11]: X = housing.data
X.shape

y = housing.target
X.shape, y.shape

Out[11]: ((20640, 8), (20640,))

In [12]: from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X = scaler.fit_transform(X)
X

Out[12]: array([[ 2.34476576, 0.98214266, 0.62855945, ..., -0.04959654,


1.05254828, -1.32783522],
[ 2.33223796, -0.60701891, 0.32704136, ..., -0.09251223,
1.04318455, -1.32284391],
[ 1.7826994 , 1.85618152, 1.15562047, ..., -0.02584253,
1.03850269, -1.33282653],
...,
[-1.14259331, -0.92485123, -0.09031802, ..., -0.0717345 ,
1.77823747, -0.8237132 ],
[-1.05458292, -0.84539315, -0.04021111, ..., -0.09122515,
1.77823747, -0.87362627],
[-0.78012947, -1.00430931, -0.07044252, ..., -0.04368215,
1.75014627, -0.83369581]])

In [13]: from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)

print(len(X_train))
print(len(X_test))

15480
5160

In [14]: # Add Intercept

#What is the shape of X they want


#(number of samples, number of features) --> correct shape
# for closed form formula
#How about the intercept
#w0 is OUR intercept
#what is the shape of w -->(n+1, )
#What is the shape of intercept --->(m, 1)
#X = [1 2 3 @ [w0
# 1 4 6 w1
# 1 9 1 w2
# 1 10 2 ]

#np.ones((shape))
intercept = np.ones((X_train.shape[0], 1))

#concatenate the intercept based on axis=1


X_train = np.concatenate((intercept, X_train), axis=1)

#np.ones((shape))
intercept = np.ones((X_test.shape[0], 1))

#concatenate the intercept based on axis=1


X_test = np.concatenate((intercept, X_test), axis=1)

In [17]: X_train

Out[17]: array([[ 1. , -0.51588775, -1.00430931, ..., -0.1006962 ,


-1.30243016, 1.33752281],
[ 1. , 0.53960528, 1.61780729, ..., -0.05983722,
-0.74060628, 0.59381804],
[ 1. , -0.14247524, 1.14105882, ..., -0.02441623,
0.95891097, -1.27792215],
...,
[ 1. , 1.44860733, -0.68647699, ..., 0.02829722,
0.83718246, -1.14315686],
[ 1. , -1.12969705, -0.60701891, ..., -0.03598869,
1.55350791, -0.18981719],
[ 1. , 0.40464198, -0.52756083, ..., -0.00465543,
1.41773381, -0.74884359]])

In [18]: from numpy.linalg import inv

#order of operation DOES NOT MATTER


#But don't flip y before X^T for example
def closed_form(X, y):
return inv(X.T @ X) @ X.T @ y

In [19]: #let's use the closed_form to find the theta


theta = closed_form(X_train, y_train)
theta #<------this is our model

Out[19]: array([ 2.06922803, 0.83307095, 0.11525569, -0.28134176, 0.30252723,


-0.00705287, -0.04216411, -0.88801746, -0.85760284])

In [20]: #Compute the accuracy/loss

yhat = X_test @ theta #==> X (m, n+1) @ (n+1, ) w ==> (m, ) y

#if I want to compare yhat and y, I need to make sure they are the same shape
assert y_test.shape == yhat.shape

In [21]: #get the mse


mse = ((y_test - yhat)**2).sum() / X_test.shape[0]
print("Mean squared errors: ", mse)

Mean squared errors: 0.5289323658169676

You might also like