Week 1 Get Familier With Jupyter Notebook
Week 1 Get Familier With Jupyter Notebook
Learning outcome:
2. Try some small exercises on the pre-work of data analysis, such as data loading, exploratory
data analysis, visualization, etc.
Download housing.tgz from LMS and save it in the datasets/housing directory in your workspace
[ ]: import pandas as pd
import os
import tarfile
[ ]: def load_housing_data(housing_path=HOUSING_PATH):
csv_path = os.path.join(housing_path, "housing.csv")
return pd.read_csv(csv_path)
Pay attention to the attributes of a new dataset. How many attributes? What are they?
[ ]: housing = load_housing_data()
housing.head()
Alternatively, you can use info() method to get a quick description of the data.
[ ]: housing.info()
1
Find out what categories exist in `ocean_proximity', and how many districts belong to each category
using value_count() method.
[ ]: housing["ocean_proximity"].value_counts()
[ ]: housing.describe()
total_bedrooms -> 20,433, not 20,640. This is because the null values are ignored.
Let's plot a histogram for each numerical attribute to get a feel of the data we are dealing with. A
histogram shows the number of instances (on the vertical axis) that have a given value range (on
the horizontal axis).
Before we can plot anything, we need to specify which backend Matplotlib should use.
We will use Jupyter's magic command %matplotlib inline -> This tells Jupyter to set up Matplotlib,
so it uses Jupyter's own backend. Plots are then rendered within the notebook itself.
[ ]: %matplotlib inline
import matplotlib.pyplot as plt
[ ]: housing.hist(bins=50, figsize=(20,15))
plt.show()
The idea of creating a test set is simple: pick some instances randomly, typically 20% of the dataset
(the ratio may vary), and set them aside.
[ ]: import numpy as np
The following code uses the pd.cut() function to create an income category attribute with ve
categories:
category 1 0-1.5
2
category 2 1.3-3,
and so on
[ ]: housing["income_cat"] = pd.cut(housing["median_income"],
bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
labels=[1,2,3,4,5])
[ ]: housing["income_cat"].hist()
[ ]: strat_test_set["income_cat"].value_counts()/len(strat_test_set)
Now we need to delete the `income_cat' attribute, so the data is back to its original state.
Next we will only explore the trainig data set and put the testing set aside. Let's create a copy so
that the following procedures will not harm the training set:
[ ]: housing = strat_train_set.copy()
We can observe an overplotting issue, making it dicult to see individual data points in a data
visualization.
We can adjust the alpha option to make the visualization better reect the high density of data
points.
We can easily compute the standard correlation coecient between every pair of attributes
3
For example,let'scheck how much each attribute
correlateswith the median house value:
[ ]:
numeric_features = housing.select_dtypes(include=['number'])
corr_matrix = numeric_features.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)
How to interpretthe results?
Alternatively
, we can check for correlationsbetween attributes
using the pandas scatter_matrix()
function.
Which attributes
seem to be more predictable
of the median house value?