0% found this document useful (0 votes)
44 views5 pages

California Housing Project

Uploaded by

anushaj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views5 pages

California Housing Project

Uploaded by

anushaj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

California housing project

Get the Data






 Pandas [1]: a library for data analysis and manipulation. It provides data
structures and operations for working with labeled data, like DataFrames.
 NumPy [2]: a library for numerical computing. It provides powerful array and
matrix operations.
 Matplotlib [3]: a library for creating static, animated, and interactive
visualizations.
 Seaborn [4]: a library built on top of Matplotlib that provides a high-level
interface for making statistical graphics.

Download the Data


Here is the function to fetch the data:
import os
import tarfile // Used to work with tar archives (compressed files).
from six.moves import urllib // Enables downloading files from the
internet
DOWNLOAD_ROOT =
"https://raw.githubusercontent.com/ageron/handson-ml2/master/" // Base
URL for downloading the data archive.
HOUSING_PATH = os.path.join("datasets", "housing") // Specifies the
local directory where the downloaded archive will be stored.
HOUSING_URL = DOWNLOAD_ROOT

+ "datasets/housing/housing.tgz"
def fetch_housing_data(housing_url=HOUSING_URL,
housing_path=HOUSING_PATH):
if not os.path.isdir(housing_path):
os.makedirs(housing_path) // creates the directory and any necessary
subdirectories.

tgz_path = os.path.join(housing_path, "housing.tgz") //Constructs the full


path for the downloaded archive file (housing.tgz) within the housing_path directory.
os.path.join combines these paths.

urllib.request.urlretrieve(housing_url, tgz_path) // handles the download.


housing_tgz = tarfile.open(tgz_path) // Opens the downloaded archive
using tarfile.open

housing_tgz.extractall(path=housing_path) // Extracts all the contents of


the archive into the housing_path directory. extractall performs the
decompression.

housing_tgz.close()
All attributes are numerical, except the ocean_proximity field. Its type is
object, so it could hold any kind of Python object, but since you loaded
this data from a CSV file you know that it must be a text attribute. When
you looked at the top five rows, you probably noticed that the values in
the ocean_proximity column were repetitive
Another quick way to get a feel of the type of data you are dealing with is
to plot a histogram for each numerical attribute.
plt.show()

%matplotlib inline # only in a Jupyter notebook


import matplotlib.pyplot as plt
housing.hist(bins=50, figsize=(20,15))
few things in these histograms:
1. First, the median income attribute does not look like it is expressed in
US dollars (USD). After checking with the team that collected the data,
you are told that the data has been scaled and capped at 15 (actually
15.0001) for higher median incomes, and at 0.5 (actually 0.4999) for
lower median incomes. The numbers represent roughly tens of
thousands of dollars (e.g., 3 actually means about $30,000). Working
with reprocessed attributes is common in Machine Learning,

2. The housing median age and the median house value were also
capped. The latter may be a serious problem since it is your target
attribute (your labels). Your Machine Learning algorithms may learn that
prices never go beyond that limit. You need to check with your client
team (the team that will use your system’s out‐ put) to see if this is a
problem or not. If they tell you that they need precise predictions even
beyond $500,000, then you have mainly two options:
a. Collect proper labels for the districts whose labels were capped.
b. Remove those districts from the training set (and also from the
test set, since your system should not be evaluated poorly if it predicts
values beyond $500,000).

Create a Test Set

You might also like