Week1 - Introduction To Machine Learning and Toolkit
Week1 - Introduction To Machine Learning and Toolkit
Topics include:
Introduction and exploratory analysis (Week 1)
Supervised machine learning (Weeks 2–10)
Unsupervised machine learning (Weeks 11–12)
2
Overview of Course
Topics include:
Introduction and exploratory analysis (Week 1)
Supervised machine learning (Weeks 2–10)
Unsupervised machine learning (Weeks 11–12)
Each week:
Lecture
Exercises with solutions
Time commitment: ~3 hours per week
3
Our Toolset: Intel® Distribution for Python
Accelerated performance from Intel's Math Kernel Library (MKL)
4
Our Toolset: Intel® Distribution for Python
Accelerated performance from Intel's Math Kernel Library (MKL)
Also contains Data Analytics Acceleration Library (DAAL), Message
Passing Interface (MPI), and Threading Building Blocks (TBB)
5
Our Toolset: Intel® Distribution for Python
Accelerated performance from Intel's Math Kernel Library (MKL)
Also contains Data Analytics Acceleration Library (DAAL), Message
Passing Interface (MPI), and Threading Building Blocks (TBB)
INSTALLATION OPTIONS
software.intel.com/
Monolithic
Distribution intel-distribution-for-python
Anaconda
Package Manager articles/using-intel-distribution-for-python-with-anaconda
6
Our Toolset: Intel® Distribution for Python
Accelerated performance from Intel's Math Kernel Library (MKL)
Also contains Data Analytics Acceleration Library (DAAL), Message
Passing Interface (MPI), and Threading Building Blocks (TBB)
INSTALLATION OPTIONS
software.intel.com/
Monolithic
Distribution intel-distribution-for-python
Anaconda
Package Manager articles/using-intel-distribution-for-python-with-anaconda
Seaborn is also required: conda install seaborn
7
Our Toolset: Intel® Distribution for Python
Jupyter notebooks:
Interactive Coding and Visualization of Output
Matplotlib, Seaborn:
Data Visualization
Scikit-learn:
Machine Learning
8
Our Toolset: Intel® Distribution for Python
Jupyter notebooks:
Interactive Coding and Visualization of Output
Scikit-learn:
Machine Learning
9
Our Toolset: Intel® Distribution for Python
Jupyter notebooks:
Interactive Coding and Visualization of Output
Matplotlib, Seaborn:
Data Visualization
Scikit-learn:
Machine Learning Weeks 2–12
10
Jupyter Notebook
Introduction to Jupyter Notebook
Polyglot analysis environment—
blends multiple languages
Source: http://jupyter.org/
12
Introduction to Jupyter Notebook
Polyglot analysis environment—
blends multiple languages
Jupyter is an anagram of:
Julia, Python, and R
Source: http://jupyter.org/
13
Introduction to Jupyter Notebook
Polyglot analysis environment—
blends multiple languages
Jupyter is an anagram of:
Julia, Python, and R
Supports multiple content types:
code, narrative text, images,
movies, etc.
Source: http://jupyter.org/
14
Introduction to Jupyter Notebook
HTML & Markdown
LaTeX (equations)
Code
Source: http://jupyter.org/
15
Introduction to Jupyter Notebook
HTML & Markdown
LaTeX (equations)
Code
Source: http://jupyter.org/
16
Introduction to Jupyter Notebook
HTML & Markdown
LaTeX (equations)
Code
Source: http://jupyter.org/
17
Introduction to Jupyter Notebook
HTML & Markdown
LaTeX (equations)
Code
Source: http://jupyter.org/
18
Introduction to Jupyter Notebook
Code is divided into cells to
control execution
Enables interactive development
Ideal for exploratory analysis and
model building
19
Introduction to Jupyter Notebook
Code is divided into cells to
control execution
Enables interactive development
Ideal for exploratory analysis and
model building
20
Jupyter Cell Magics
%matplotlib inline: display plots
inline in Jupyter notebook
21
Jupyter Cell Magics
%matplotlib inline: display plots
inline in Jupyter notebook
22
Jupyter Cell Magics
%matplotlib inline: display plots
inline in Jupyter notebook
23
Jupyter Cell Magics
%matplotlib inline: display plots
inline in Jupyter notebook
%%timeit: time how long a cell
takes to execute
24
Jupyter Cell Magics
%matplotlib inline: display plots
inline in Jupyter notebook
%%timeit: time how long a cell
takes to execute
%run filename.ipynb: execute
code from another notebook or
python file
25
Jupyter Cell Magics
%matplotlib inline: display plots
inline in Jupyter notebook
%%timeit: time how long a cell
takes to execute
%run filename.ipynb: execute
code from another notebook or
python file
%load filename.py: copy contents
of the file and paste into the cell
26
Jupyter Keyboard Shortcuts
27
Making Jupyter Notebooks Reusable
To extract Python code from a Jupyter notebook:
28
Making Jupyter Notebooks Reusable
To extract Python code from a Jupyter notebook:
29
pandas
Introduction to Pandas
Library for computation with tabular
data
Mixed types of data allowed in a single
table
Columns and rows of data can be named
Advanced data aggregation and
statistical functions
Source: http://pandas.pydata.org/
31
Introduction to Pandas
Basic data structures
Vector
Series
(1 Dimension)
32
Introduction to Pandas
Basic data structures
Vector
Series
(1 Dimension)
Array
DataFrame
(2 Dimensions)
33
Pandas Series Creation and Indexing
Use data from step tracking application to create a Pandas Series
CODE OUTPUT
import pandas as pd
step_counts = pd.Series(step_data,
name='steps')
print(step_counts)
34
Pandas Series Creation and Indexing
Use data from step tracking application to create a Pandas Series
CODE OUTPUT
print(step_counts)
35
Pandas Series Creation and Indexing
Add a date range to the Series
CODE OUTPUT
step_counts.index = pd.date_range('20150329’,
periods=6)
print(step_counts)
36
Pandas Series Creation and Indexing
Add a date range to the Series
CODE OUTPUT
37
Pandas Series Creation and Indexing
Select data by the index values
CODE OUTPUT
38
Pandas Series Creation and Indexing
Select data by the index values
CODE OUTPUT
39
Pandas Series Creation and Indexing
Select data by the index values
CODE OUTPUT
40
Pandas Series Creation and Indexing
Select data by the index values
CODE OUTPUT
41
Pandas Series Creation and Indexing
Select data by the index values
CODE OUTPUT
42
Pandas Series Creation and Indexing
Select data by the index values
CODE OUTPUT
43
Pandas Data Types and Imputation
Data types can be viewed and converted
CODE OUTPUT
44
Pandas Data Types and Imputation
Data types can be viewed and converted
CODE OUTPUT
45
Pandas Data Types and Imputation
Data types can be viewed and converted
CODE OUTPUT
# Convert to a float
step_counts = step_counts.astype(np.float)
46
Pandas Data Types and Imputation
Data types can be viewed and converted
CODE OUTPUT
# Convert to a float
step_counts = step_counts.astype(np.float)
47
Pandas Data Types and Imputation
Invalid data points can be easily filled with values
CODE OUTPUT
print(step_counts[1:3])
48
Pandas Data Types and Imputation
Invalid data points can be easily filled with values
CODE OUTPUT
print(step_counts[1:3])
49
Pandas DataFrame Creation and Methods
DataFrames can be created from lists, dictionaries, and Pandas Series
CODE OUTPUT
# Cycling distance
cycling_data = [10.7, 0, None, 2.4, 15.3,
10.9, 0, None]
# The dataframe
activity_df = pd.DataFrame(joined_data)
print(activity_df)
50
Pandas DataFrame Creation and Methods
DataFrames can be created from lists, dictionaries, and Pandas Series
CODE OUTPUT
# Cycling distance
cycling_data = [10.7, 0, None, 2.4, 15.3, >>>
10.9, 0, None]
# The dataframe
activity_df = pd.DataFrame(joined_data)
print(activity_df)
51
Pandas DataFrame Creation and Methods
Labeled columns and an index can be added
CODE OUTPUT
print(activity_df)
52
Pandas DataFrame Creation and Methods
Labeled columns and an index can be added
CODE OUTPUT
print(activity_df)
53
Indexing DataFrame Rows
DataFrame rows can be indexed by row using the 'loc' and 'iloc' methods
CODE OUTPUT
54
Indexing DataFrame Rows
DataFrame rows can be indexed by row using the 'loc' and 'iloc' methods
CODE OUTPUT
55
Indexing DataFrame Rows
DataFrame rows can be indexed by row using the 'loc' and 'iloc' methods
CODE OUTPUT
56
Indexing DataFrame Rows
DataFrame rows can be indexed by row using the 'loc' and 'iloc' methods
CODE OUTPUT
57
Indexing DataFrame Columns
DataFrame columns can be indexed by name
CODE OUTPUT
# Name of column
print(activity_df['Walking'])
58
Indexing DataFrame Columns
DataFrame columns can be indexed by name
CODE OUTPUT
# Name of column
print(activity_df['Walking']) >>> 2015-03-29 3620
2015-03-30 7891
2015-03-31 9761
2015-04-01 3907
2015-04-02 4338
2015-04-03 5373
Freq: D, Name: Walking,
dtype: int64
59
Indexing DataFrame Columns
DataFrame columns can also be indexed as properties
CODE OUTPUT
# Object-oriented approach
print(activity_df.Walking)
60
Indexing DataFrame Columns
DataFrame columns can also be indexed as properties
CODE OUTPUT
61
Indexing DataFrame Columns
DataFrame columns can be indexed by integer
CODE OUTPUT
# First column
print(activity_df.iloc[:,0])
62
Indexing DataFrame Columns
DataFrame columns can be indexed by integer
CODE OUTPUT
63
Reading Data with Pandas
CSV and other common filetypes can be read with a single command
CODE OUTPUT
64
Reading Data with Pandas
CSV and other common filetypes can be read with a single command
CODE OUTPUT
65
Assigning New Data to a DataFrame
Data can be (re)assigned to a DataFrame column
CODE OUTPUT
66
Assigning New Data to a DataFrame
Data can be (re)assigned to a DataFrame column
CODE OUTPUT
67
Applying a Function to a DataFrame Column
Functions can be applied to columns or rows of a DataFrame or Series
CODE OUTPUT
print(data.iloc[:5, -3:])
68
Applying a Function to a DataFrame Column
Functions can be applied to columns or rows of a DataFrame or Series
CODE OUTPUT
print(data.iloc[:5, -3:])
69
Concatenating Two DataFrames
Two DataFrames can be concatenated along either dimension
CODE OUTPUT
print(small_data.iloc[:,-3:])
70
Concatenating Two DataFrames
Two DataFrames can be concatenated along either dimension
CODE OUTPUT
print(small_data.iloc[:,-3:])
71
Aggregated Statistics with GroupBy
Using the groupby method calculated aggregated DataFrame statistics
CODE OUTPUT
print(group_sizes)
72
Aggregated Statistics with GroupBy
Using the groupby method calculated aggregated DataFrame statistics
CODE OUTPUT
print(group_sizes)
73
Performing Statistical Calculations
Pandas contains a variety of statistical methods—mean, median, and mode
CODE OUTPUT
74
Performing Statistical Calculations
Pandas contains a variety of statistical methods—mean, median, and mode
CODE OUTPUT
75
Performing Statistical Calculations
Pandas contains a variety of statistical methods—mean, median, and mode
CODE OUTPUT
76
Performing Statistical Calculations
Pandas contains a variety of statistical methods—mean, median, and mode
CODE OUTPUT
77
Performing Statistical Calculations
Standard deviation, variance, SEM, and quantiles can also be calculated
CODE OUTPUT
78
Performing Statistical Calculations
Standard deviation, variance, SEM, and quantiles can also be calculated
CODE OUTPUT
79
Performing Statistical Calculations
Standard deviation, variance, SEM, and quantiles can also be calculated
CODE OUTPUT
# As well as quantiles
print(data.quantile(0)) >>> sepal_length 4.3
sepal_width 2.0
petal_length 1.0
petal_width 0.1
Name: 0, dtype: float64
80
Performing Statistical Calculations
Multiple calculations can be presented in a DataFrame
CODE OUTPUT
print(data.describe())
81
Performing Statistical Calculations
Multiple calculations can be presented in a DataFrame
CODE OUTPUT
print(data.describe()) >>>
82
Sampling from DataFrames
DataFrames can be randomly sampled from
CODE OUTPUT
print(sample.iloc[:,-3:])
83
Sampling from DataFrames
DataFrames can be randomly sampled from
CODE OUTPUT
print(sample.iloc[:,-3:])
84
Sampling from DataFrames
DataFrames can be randomly sampled from
CODE OUTPUT
print(sample.iloc[:,-3:])
85
Visualization Libraries
Visualization Libraries
Visualizations can be created in multiple ways:
Matplotlib
Pandas (via Matplotlib)
Seaborn
– Statistically-focused plotting methods
– Global preferences incorporated by Matplotlib
87
Basic Scatter Plots with Matplotlib
Scatter plots can be created from Pandas Series
CODE OUTPUT
plt.plot(data.sepal_length,
data.sepal_width,
ls ='', marker='o')
88
Basic Scatter Plots with Matplotlib
Scatter plots can be created from Pandas Series
CODE OUTPUT
4.0
plt.plot(data.sepal_length,
data.sepal_width, 3.5
ls ='', marker='o')
3.0
2.5
2.0
5 6 7 8
89
Basic Scatter Plots with Matplotlib
Multiple layers of data can also be added
CODE OUTPUT
plt.plot(data.sepal_length,
data.sepal_width,
ls ='', marker='o’,
label='sepal')
plt.plot(data.petal_length,
data.petal_width,
ls ='', marker='o’,
label='petal')
90
Basic Scatter Plots with Matplotlib
Multiple layers of data can also be added
CODE OUTPUT
plt.plot(data.sepal_length,
sepal
data.sepal_width, petal
4
ls ='', marker='o’,
label='sepal')
3
plt.plot(data.petal_length,
data.petal_width, 2
ls ='', marker='o’,
label='petal')
1
0
2 4 6 8
91
Histograms with Matplotlib
Histograms can be created from Pandas Series
CODE OUTPUT
plt.hist(data.sepal_length, bins=25)
92
Histograms with Matplotlib
Histograms can be created from Pandas Series
CODE OUTPUT
plt.hist(data.sepal_length, bins=25)
16
14
12
10
0
5 6 7 8
93
Customizing Matplotlib Plots
Every feature of Matplotlib plots can be customized
CODE OUTPUT
fig, ax = plt.subplots()
ax.barh(np.arange(10),
data.sepal_width.iloc[:10])
94
Customizing Matplotlib Plots
Every feature of Matplotlib plots can be customized
CODE OUTPUT
fig, ax = plt.subplots()
ax.barh(np.arange(10),
data.sepal_width.iloc[:10])
95
Incorporating Statistical Calculations
Statistical calculations can be included with Pandas methods
CODE OUTPUT
(data
.groupby('species')
.mean()
.plot(color=['red','blue’,
'black','green’],
fontsize=10.0, figsize=(4,4)))
96
Incorporating Statistical Calculations
Statistical calculations can be included with Pandas methods
CODE OUTPUT
(data
.groupby('species')
.mean()
.plot(color=['red','blue’,
'black','green’],
fontsize=10.0, figsize=(4,4)))
97
Statistical Plotting with Seaborn
Joint distribution and scatter plots can be created
CODE OUTPUT
sns.jointplot(x='sepal_length’,
y='sepal_width’,
data=data, size=4)
98
Statistical Plotting with Seaborn
Joint distribution and scatter plots can be created
CODE OUTPUT
sepal_width
3.0
2.5
2.0
5 6 7 8
sepal_length
99
Statistical Plotting with Seaborn
Correlation plots of all variable pairs can also be made with Seaborn
CODE OUTPUT
100
Statistical Plotting with Seaborn
Correlation plots of all variable pairs can also be made with Seaborn
CODE OUTPUT
sepal_length
6
sepal_width
3
2
species
Iris-setosa
Iris-versicolor
6 Iris-virginica
petal_length
4
petal_width
1
0
5.0 7.5 2 4 2.5 5.0 0 2
sepal_length sepal_width petal_length petal_width
101