Regression Scikit Learn
Regression Scikit Learn
▪ introduction
▪ installation/distribution
▪ essential/auxiliary libraries
▪ usage
1
scikit-learn
▪ free
introduction--- ▪ open-source
▪ constantly being developed and improved
scikit-learn (also known as sklearn) is ▪ an active user community.
a free software machine ▪ state-of-the-art machine learning algorithms
learning library for ▪ provides nice documentation
the Python programming language. ▪ widely used in industry and academia
▪ a wealth of tutorials and code snippets are
available online.
▪ works well with many scientific Python
tools
2
scikit-learn
3
scikit-learn
installation--- 1
• Anaconda Free
(recommended)
▪ can be independently installed
▪ (recommended) can be 2
• Enthought canopy Not free
4
scikit-learn
comes with:
▪ NumPy,
▪ SciPy,
▪
anaconda --- ▪
matplotlib,
pandas,
a Python distribution for ▪ IPython,
▪ Jupyter Notebook,
large-scale data processing,
▪ scikit-learn
predictive analytics, and
scientific computing ⇛ available on:
▪ Mac
▪ OS
▪ Windows
5
scikit-learn
Jupyter Notebook
• provides an interactive environment
libraries--- • runs code in the browser.
• great tool for exploratory data analysis
essentially required or increase the
⇛ NumPy •
•
widely used by data scientists
supports many programming languages
effectiveness of scikit-learn
⇛ SciPy
⇛ Jupyter Notebook NumPy
⇛ matplotlib • fundamental packages for scientific computing
• provides functionality for:
⇛ Pandas • multidimensional arrays
• high-level mathematical functions, e.g.,
• linear algebra operations
• Fourier transform
• pseudorandom number generators.
6
scikit-learn
NumPy, SciPy
:: strengths ::
7
scikit-learn
SciPy
libraries--- • a collection of functions for scientific computing
• provides, among other functionality:
essentially required or increase the
•
⇛ SciPy •
mathematical function optimization,
signal processing,
⇛ Jupyter Notebook • special mathematical functions,
⇛ matplotlib • statistical distributions.
⇛ Pandas • scikit-learn draws from SciPy’s collection of functions
for implementing its algorithms.
8
scikit-learn
matplotlib
libraries--- • primary scientific plotting library in Python
essentially required or increase the
9
scikit-learn
libraries--- pandas
essentially required or increase the
⇛ NumPy •
•
Python library for data wrangling and analysis
effectiveness of scikit-learn
10
Fitting the Linear Regression Model
𝑚
𝜏 = 𝑥 𝑖 ,𝑦 𝑖
𝑖=1
, 𝑥 𝑖
𝜖ℝ 𝑛
,𝑦 𝑖
∈ℝ
(𝑖) (𝑖) (𝑖)
▪ model: 𝑦ො = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑛 𝑥𝑛
▪ model parameters: 𝑤0 , 𝑤1 , 𝑤1 ,…, 𝑤𝑛
▪ intercept: 𝑤0
▪ coefficients: 𝑤1 , 𝑤1 ,…, 𝑤𝑛
11
the Boston data
• The Boston house-price data of Harrison, D. and
Rubinfeld, D. L. 'Hedonic prices and the demand
for clean air', J. Environ. Economics &
Management, vol.5, 81-102, 1978.
▪ Regression diagnostics: Identifying Influential
Data and Sources of Collinearity’…what
influences housing prices in Boston-
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV
0.00632 18 2.31 0 0.538 6.575 65.2 4.09 1 296 15.3 396.9 4.98 24
0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.9 9.14 21.6
0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7
0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4
0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.9 5.33 36.2
0.02985 0 2.18 0 0.458 6.43 58.7 6.0622 3 222 18.7 394.12 5.21 28.7
0.08829 12.5 7.87 0 0.524 6.012 66.6 5.5605 5 311 15.2 395.6 12.43 22.9
0.14455 12.5 7.87 0 0.524 6.172 96.1 5.9505 5 311 15.2 396.9 19.15 27.1
0.21124 12.5 7.87 0 0.524 5.631 100 6.0821 5 311 15.2 386.63 29.93 16.5
12
the Boston housing example
𝑖 𝑖 𝑖
𝑥1 𝑥2
… 𝑥13 𝑦ො 𝑖
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV
0.00632 18 2.31 0 0.538 6.575 65.2 4.09 1 296 15.3 396.9 4.98 24
0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.9 9.14 21.6
0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7
0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4
0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.9 5.33 36.2
0.02985 0 2.18 0 0.458 6.43 58.7 6.0622 3 222 18.7 394.12 5.21 28.7
0.08829 12.5 7.87 0 0.524 6.012 66.6 5.5605 5 311 15.2 395.6 12.43 22.9
0.14455 12.5 7.87 0 0.524 6.172 96.1 5.9505 5 311 15.2 396.9 19.15 27.1
0.21124 12.5 7.87 0 0.524 5.631 100 6.0821 5 311 15.2 386.63 29.93 16.5
Steps ---
▪ import the dataset loader
▪ create the loader object
▪ explore/understand the data
▪ shape of the data
▪ description(DESCR)
▪ feature names/values feature values target values
▪ target names/values
names of the features
▪ file path
▪ etc.
information about the data
file path
exploring the data 15
Steps ---
▪ import the dataset loader
▪ create the loader object
▪ explore/understand the data #columns/
#features
#rows/
▪ shape of the data #training examples
▪ description(DESCR)
▪ feature names/values
▪ target names/values
▪ file path
▪ etc.
Steps ---
▪ import the dataset loader
▪ create the loader object
▪ explore/understand the data
▪ shape of the data
▪ description(DESCR)
▪ feature names/values
▪ target names/values
▪ file path
▪ etc.
exploring the data 17
Steps ---
▪ import the dataset loader
▪ create the loader object
▪ explore/understand the data
▪ shape of the data
▪ description(DESCR)
▪ feature names/values
▪ target names/values
▪ file path
▪ etc.
exploring the data 18
Steps ---
▪ import the dataset loader
▪ create the loader object
▪ explore/understand the data
▪ shape of the data
▪ description(DESCR)
▪ feature names/values
▪ target names/values
▪ file path
▪ etc.
exploring the data 19
Steps ---
▪ import the dataset loader
▪ create the loader object
▪ explore/understand the data
▪ shape of the data
▪ description(DESCR) …
▪ feature names/values
▪ target names/values
▪ file path
▪ etc.
exploring the data 20
training ---
▪ split the data into training(75%), test sets (25%)
▪ import the model
▪ fit the model to the data
▪ test the model
▪ predict
training the algorithm 21
training ---
▪ split the data into training(75%), test sets (25%)
▪ import the model
▪ fit the model to the data
▪ test the model
▪ predict
training the algorithm 22
training ---
▪ split the data into training/test sets
▪ import the model
▪ fit the model to the data
▪ test the model
▪ predict
training the algorithm 23
training ---
▪ split the data into training/test sets
▪ import the model
▪ fit the model to the data
▪ test the model
▪ predict
the iris data
Iris Flower--
Data about 150 iris flowers to
be classified into 3 varieties; Sepal length Sepal width Petal length Petal width specie
size: 150 × (4 + 1)
24
25
step 1
Steps---
▪ load the data
▪ explore the data step 2
▪ split into training and validation
subsets
▪ import the optimizer
step 3
26
step 4
Steps---
▪ load the data
▪ explore the data step 5
▪ split into training and validation
subsets
▪ import the optimizer
▪ fit to the data (derive the model)
▪ check accuracy of the model on step 6
the data
27
step 4
Steps---
▪ load the data
▪ explore the data step 5
▪ split into training and validation
subsets
▪ import the optimizer
▪ fit to the data (derive the model)
▪ check accuracy of the model on step 6
the data
▪ predict with the model derived
28
step 7
Steps---
▪ load the data
▪ explore the data
▪ split into training and validation
subsets
▪ import the optimizer
▪ fit to the data (derive the model)
▪ check accuracy of the model on
the data
▪ predict with the model derived
29
32
end
33