0% found this document useful (0 votes)

17 views4 pages

Week 1 Get Familier With Jupyter Notebook

The document outlines a Week 1 plan for familiarizing oneself with Jupyter Notebook and conducting basic data analysis using a housing dataset. It includes steps for loading the dataset, exploring its attributes, creating test and training sets, and visualizing data through scatterplots and histograms. Additionally, it discusses correlation analysis between attributes to identify predictors of median house value.

Uploaded by

unimurdoch

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views4 pages

Week 1 Get Familier With Jupyter Notebook

Uploaded by

unimurdoch

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Week 1 Get familier with Jupyter Notebook

Learning outcome:

1. Get familiar with Jupyter Notebook

2. Try some small exercises on the pre-work of data analysis, such as data loading, exploratory
data analysis, visualization, etc.

1 Load the housing dataset using pandas

Download housing.tgz from LMS and save it in the datasets/housing directory in your workspace

[ ]: import pandas as pd
import os
import tarfile

[ ]: HOUSING_PATH = os.path.join("datasets", "housing")

[ ]: tgz_path = os.path.join(HOUSING_PATH, "housing.tgz")

housing_tgz = tarfile.open(tgz_path)
housing_tgz.extractall(path=HOUSING_PATH)
housing_tgz.close()

Write a small function to load the data:

[ ]: def load_housing_data(housing_path=HOUSING_PATH):
csv_path = os.path.join(housing_path, "housing.csv")
return pd.read_csv(csv_path)

Check the top ve rows using the head() method.

Pay attention to the attributes of a new dataset. How many attributes? What are they?

[ ]: housing = load_housing_data()
housing.head()

Alternatively, you can use info() method to get a quick description of the data.

What is each attribute's data type?

[ ]: housing.info()

1
Find out what categories exist in `ocean_proximity', and how many districts belong to each category
using value_count() method.

[ ]: housing["ocean_proximity"].value_counts()

Next, describe() method shows a summary of the numerical attributes.

[ ]: housing.describe()

total_bedrooms -> 20,433, not 20,640. This is because the null values are ignored.

Let's plot a histogram for each numerical attribute to get a feel of the data we are dealing with. A
histogram shows the number of instances (on the vertical axis) that have a given value range (on
the horizontal axis).

Before we can plot anything, we need to specify which backend Matplotlib should use.

We will use Jupyter's magic command %matplotlib inline -> This tells Jupyter to set up Matplotlib,
so it uses Jupyter's own backend. Plots are then rendered within the notebook itself.

[ ]: %matplotlib inline
import matplotlib.pyplot as plt

[ ]: housing.hist(bins=50, figsize=(20,15))
plt.show()

Do you observe any problems with these histograms?

2 Create a Test Set

The idea of creating a test set is simple: pick some instances randomly, typically 20% of the dataset
(the ratio may vary), and set them aside.

[ ]: import numpy as np

[ ]: from sklearn.model_selection import train_test_split

2.1 Random sampling

[ ]: train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)

2.2 Stratied sampling

We're told that the median income is essential in predicting median housing prices. So we want to
ensure that the test set is representative of the various categories of incomes in the whole dataset.

The following code uses the pd.cut() function to create an income category attribute with ve
categories:

category 1 0-1.5

2
category 2 1.3-3,

and so on

[ ]: housing["income_cat"] = pd.cut(housing["median_income"],
bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
labels=[1,2,3,4,5])

[ ]: housing["income_cat"].hist()

[ ]: from sklearn.model_selection import StratifiedShuffleSplit

[ ]: split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)

for train_index, test_index in split.split(housing, housing["income_cat"]):
strat_train_set = housing.loc[train_index]
strat_test_set = housing.loc[test_index]

Check the income category proportions in the test set

[ ]: strat_test_set["income_cat"].value_counts()/len(strat_test_set)

Now we need to delete the `income_cat' attribute, so the data is back to its original state.

[ ]: for set_ in (strat_train_set, strat_test_set):

set_.drop("income_cat", axis=1, inplace=True)

3 Discover the dataset - visualization

Next we will only explore the trainig data set and put the testing set aside. Let's create a copy so
that the following procedures will not harm the training set:

[ ]: housing = strat_train_set.copy()

Let's rst create a scatterplot of all districts to visualize the data.

[ ]: housing.plot(kind="scatter", x="longitude", y="latitude")

We can observe an overplotting issue, making it dicult to see individual data points in a data
visualization.

We can adjust the alpha option to make the visualization better reect the high density of data
points.

[ ]: housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.1)

4 Correlations between attributes

We can easily compute the standard correlation coecient between every pair of attributes

3
For example,let'scheck how much each attribute
correlateswith the median house value:
[ ]:
numeric_features = housing.select_dtypes(include=['number'])
corr_matrix = numeric_features.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)
How to interpretthe results?

Alternatively
, we can check for correlationsbetween attributes
using the pandas scatter_matrix()

function.

[ ]: Let'sfocus on a few promising attributes

thatseem most correlated withthe median housing value.

[ ]: from pandas.plotting import scatter_matrix

attributes = ["median_house_value", "median_income", "total_rooms", "housing_median_age"]

scatter_matrix(housing[attributes], figsize=(12, 8))

Which attributes
seem to be more predictable
of the median house value?

** Disclaim: The above code is modi

ed from the textbookHands-on Machine Learning with Scikit-Learn,
Keras&T
ensorFlow.

Resubmission Claim Request Form
100% (1)
Resubmission Claim Request Form
4 pages
Python For DS Cheat Sheet
100% (2)
Python For DS Cheat Sheet
6 pages
Hotpoint Bhwm129uk-1
No ratings yet
Hotpoint Bhwm129uk-1
13 pages
Unit 1: Shobana T S Assistant Professor Dept. of ISE, BMSCE
No ratings yet
Unit 1: Shobana T S Assistant Professor Dept. of ISE, BMSCE
127 pages
Machine Learning (BCSL606) Lab Manual
No ratings yet
Machine Learning (BCSL606) Lab Manual
117 pages
2023 JSS-G7 Term 3 Integrated Science MidTerm Exam
No ratings yet
2023 JSS-G7 Term 3 Integrated Science MidTerm Exam
2 pages
Ativa g7 How To
No ratings yet
Ativa g7 How To
64 pages
ML Lab Manual
No ratings yet
ML Lab Manual
60 pages
Unit 2
No ratings yet
Unit 2
78 pages
2 DataPreProcessing Code
No ratings yet
2 DataPreProcessing Code
46 pages
Report
No ratings yet
Report
40 pages
Adani Light Bill
No ratings yet
Adani Light Bill
3 pages
ML 1-11
No ratings yet
ML 1-11
27 pages
ML Book Notes
No ratings yet
ML Book Notes
9 pages
Lecture 4
No ratings yet
Lecture 4
56 pages
End To End Machine Learning Project-2
No ratings yet
End To End Machine Learning Project-2
10 pages
Plutonium Report
No ratings yet
Plutonium Report
359 pages
Lecture02. ML Pipeline (Chapter 2)
No ratings yet
Lecture02. ML Pipeline (Chapter 2)
50 pages
The Data Science Process
100% (1)
The Data Science Process
53 pages
ICT619 Topic 2
No ratings yet
ICT619 Topic 2
64 pages
Continental GPEC2 Locked
100% (5)
Continental GPEC2 Locked
3 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
33 pages
Strategic Management
33% (3)
Strategic Management
31 pages
ML Labmanual
No ratings yet
ML Labmanual
33 pages
ISMLA Module5
No ratings yet
ISMLA Module5
25 pages
AIMLlatestmodule 2notes Removed
No ratings yet
AIMLlatestmodule 2notes Removed
33 pages
Machine Learning Laboratory
No ratings yet
Machine Learning Laboratory
23 pages
Module 2
No ratings yet
Module 2
35 pages
ML 3
No ratings yet
ML 3
24 pages
MiniProject BI
No ratings yet
MiniProject BI
16 pages
L03 The Regression Pipeline
No ratings yet
L03 The Regression Pipeline
94 pages
Module 2notes
No ratings yet
Module 2notes
44 pages
Project Report
100% (6)
Project Report
75 pages
Dawit House
No ratings yet
Dawit House
49 pages
Site Survey Checklist
100% (3)
Site Survey Checklist
2 pages
M PDF
No ratings yet
M PDF
13 pages
Making Predictions
No ratings yet
Making Predictions
13 pages
02 End To End Machine Learning Project
No ratings yet
02 End To End Machine Learning Project
26 pages
Exercise5 Solution
No ratings yet
Exercise5 Solution
22 pages
Tutorial 1
No ratings yet
Tutorial 1
10 pages
Data Mining Practicals Complete
No ratings yet
Data Mining Practicals Complete
13 pages
2 Program
No ratings yet
2 Program
8 pages
Faseeh Chap 2 Report
No ratings yet
Faseeh Chap 2 Report
30 pages
Boiler Blowdown CV Wiratama Perkasa
No ratings yet
Boiler Blowdown CV Wiratama Perkasa
6 pages
Reading Data: #Importing Required Libraries
No ratings yet
Reading Data: #Importing Required Libraries
16 pages
House Price Prediction Models
No ratings yet
House Price Prediction Models
16 pages
Explain Me Every Code Written in It With Deep Know
No ratings yet
Explain Me Every Code Written in It With Deep Know
7 pages
Regression Analysis - Lasso and Ridge Regularization
No ratings yet
Regression Analysis - Lasso and Ridge Regularization
17 pages
External
No ratings yet
External
11 pages
House Pricing
No ratings yet
House Pricing
15 pages
Injecttive Blockchain
No ratings yet
Injecttive Blockchain
14 pages
Regression Analysis
No ratings yet
Regression Analysis
17 pages
Nam Dinh
No ratings yet
Nam Dinh
7 pages
Boston House Prediction - Colab1
No ratings yet
Boston House Prediction - Colab1
10 pages
USA Real Estate Price Prediction Using Decision Tree Regressor, and AdaBoost Regressor
No ratings yet
USA Real Estate Price Prediction Using Decision Tree Regressor, and AdaBoost Regressor
14 pages
1684918425867
No ratings yet
1684918425867
14 pages
Setup: Chapter 2 - End-To-End Machine Learning Project
No ratings yet
Setup: Chapter 2 - End-To-End Machine Learning Project
31 pages
ML Short Code - Under Updating
No ratings yet
ML Short Code - Under Updating
4 pages
Module 2
No ratings yet
Module 2
20 pages
P04 The Regression Pipeline - Preprocessing Ans
No ratings yet
P04 The Regression Pipeline - Preprocessing Ans
19 pages
322170
No ratings yet
322170
70 pages
ML Lab - Exp1-10
No ratings yet
ML Lab - Exp1-10
4 pages
T 5
No ratings yet
T 5
4 pages
Government Gazette - 20th September PA
No ratings yet
Government Gazette - 20th September PA
56 pages
Unit1 ML Programs
No ratings yet
Unit1 ML Programs
5 pages
California Housing Project
No ratings yet
California Housing Project
5 pages
Machine Learning Labnem
No ratings yet
Machine Learning Labnem
5 pages
Setup Guide
No ratings yet
Setup Guide
2 pages
Nominal:, 14, 16, 18, 21 Median 14 (Most Frequent Value) 13 Occurs Four Times Mode 13
No ratings yet
Nominal:, 14, 16, 18, 21 Median 14 (Most Frequent Value) 13 Occurs Four Times Mode 13
2 pages
Greening Interlibrary Loan Practices: Dennis Massie Program Officer, OCLC Research
No ratings yet
Greening Interlibrary Loan Practices: Dennis Massie Program Officer, OCLC Research
35 pages
20BCP021 Assignment 6
No ratings yet
20BCP021 Assignment 6
15 pages
Real Estate
No ratings yet
Real Estate
10 pages
EDA and Hypothesis Testing On KC Housing Data: Daniele Sammarco - Exploratory Data Analysis For Machine Learning by IBM
No ratings yet
EDA and Hypothesis Testing On KC Housing Data: Daniele Sammarco - Exploratory Data Analysis For Machine Learning by IBM
9 pages
FALLSEM2021-22 MDI4001 ETH VL2021220104135 Reference Material I 09-Aug-2021 Data2 1
No ratings yet
FALLSEM2021-22 MDI4001 ETH VL2021220104135 Reference Material I 09-Aug-2021 Data2 1
9 pages
Opi Marine Presentation V5mail
No ratings yet
Opi Marine Presentation V5mail
20 pages
DSBDA Prac4 2
No ratings yet
DSBDA Prac4 2
1 page
Air Slides PPT 2004-11-08 DGL
No ratings yet
Air Slides PPT 2004-11-08 DGL
28 pages
Blank Online Dispute Form
No ratings yet
Blank Online Dispute Form
2 pages
AR-15 Ri e Barrel Assembly Instructions - Page 1
No ratings yet
AR-15 Ri e Barrel Assembly Instructions - Page 1
4 pages
House Price Prediction: Project Description
No ratings yet
House Price Prediction: Project Description
11 pages
5c - Lissa - Social Media Committee Meeting 10 5 14
No ratings yet
5c - Lissa - Social Media Committee Meeting 10 5 14
3 pages
PINOYBIX Tests and Measurements
No ratings yet
PINOYBIX Tests and Measurements
20 pages
Application Procedure - Insulcast 1500
No ratings yet
Application Procedure - Insulcast 1500
4 pages
Developing XML Web Services Using Microsoft: Course Number: 2524B
No ratings yet
Developing XML Web Services Using Microsoft: Course Number: 2524B
10 pages
Biztalk Server 2016 Course Content
No ratings yet
Biztalk Server 2016 Course Content
7 pages
Load Flow Analysis
No ratings yet
Load Flow Analysis
16 pages
Introduction To Machine Learning (ML) With Sklearn
No ratings yet
Introduction To Machine Learning (ML) With Sklearn
10 pages
8.1.3.7 Packet Tracer Lab 2
No ratings yet
8.1.3.7 Packet Tracer Lab 2
6 pages
Asset-V1 VIT+MBA109+2020+type@asset+block@Introductio To ML Using Python
No ratings yet
Asset-V1 VIT+MBA109+2020+type@asset+block@Introductio To ML Using Python
7 pages
Roshan Kumar: Contact No - + 91 9019253022
No ratings yet
Roshan Kumar: Contact No - + 91 9019253022
3 pages
F 2 Pentathane 4510
No ratings yet
F 2 Pentathane 4510
3 pages
NTPC Ad2013
No ratings yet
NTPC Ad2013
4 pages
Lesson Plan: Business/Materials Lesson Objectives
No ratings yet
Lesson Plan: Business/Materials Lesson Objectives
2 pages
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet