Python Data Science Readthedocs Io en Latest
Python Data Science Readthedocs Io en Latest
Release 0.1
Jake Teo
1 General Notes 3
1.1 Virtual Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Learning 11
2.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Kaggle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3 Exploratory Analysis 15
3.1 Univariate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Multi-Variate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4 Feature Preprocessing 23
4.1 Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2 Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3 Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.4 Coordinates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5 Feature Normalization 29
5.1 Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.2 Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.3 Persistance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6 Feature Engineering 33
6.1 Manual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6.2 Auto . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
7 Class Imbalance 39
7.1 Over-Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
7.2 Under-Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
7.3 Under/Over-Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
7.4 Cost Sensitive Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
8 Data Leakage 41
8.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
8.2 Types of Leakages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
8.3 Detecting Leakages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
i
8.4 Minimising Leakages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
9 Supervised Learning 45
9.1 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
9.2 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
10 Unsupervised Learning 73
10.1 Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
10.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
10.3 One-Class Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
10.4 Distance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
13 Evaluation 135
13.1 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
13.2 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
13.3 K-fold Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
13.4 Hyperparameters Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
14 Explainability 155
14.1 Feature Importance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
14.2 Permutation Importance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
14.3 Partial Dependence Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
14.4 SHAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
15 Utilities 161
15.1 Persistance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
15.2 Memory Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
15.3 Parallel Pandas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
15.4 Jupyter Extension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
16 Flask 165
16.1 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
16.2 Folder Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
16.3 App Configs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
16.4 Manipulating HTML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
16.5 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
16.6 File Upload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
16.7 Logging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
16.8 Docker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
16.9 Storing Keys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
16.10 Changing Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
16.11 Parallel Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
ii
16.12 Scaling Flask . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
16.13 OpenAPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
16.14 Rate Limiting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
16.15 Successors to Flask . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
17 FastAPI 177
17.1 Uvicorn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
17.2 Request-Response Schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
17.3 Render Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
17.4 OpenAPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
17.5 Asynchronous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
18 Docker 181
18.1 Creating Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
18.2 Docker Compose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
18.3 Docker Swarm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
18.4 Networking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
18.5 Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
18.6 Small Efficient Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
iii
iv
Data Science Documentation, Release 0.1
This documentation summarises various machine learning techniques in Python. A lot of the content are compiled
from various resources, so please cite them appropriately if you are using.
Contents 1
Data Science Documentation, Release 0.1
2 Contents
CHAPTER 1
General Notes
Every project has a different set of requirements and different set of python packages to support it. The versions of
each package can differ or break with each python or dependent packages update, so it is important to isolate every
project within an enclosed virtual environment. Anaconda provides a straight forward way to manage this.
# create environment, specify python base or it will copy all existing packages
conda create -n yourenvname anaconda
conda create -n yourenvname python=3.7
conda create -n yourenvname anaconda python=3.7
# activate environment
source activate yourenvname
# install package
conda install -n yourenvname [package]
# deactivate environment
conda deactivate
# delete environment
# -a = all, remove all packages in the environment
conda env remove -n yourenvname -a
3
Data Science Documentation, Release 0.1
Alternatively, we can create a fixed environment file and execute using conda env create -f
environment.yml. This will create an environment with the name and packages specified within the folder.
Channels specify where the packages are installed from.
name: environment_name
channels:
- conda-forge
- defaults
dependencies:
- python=3.7
- bokeh=0.9.2
- numpy=1.9.*
- pip:
- baytune==0.3.1
1.1.3 Requirements.txt
If there is no ymal file specifying the packages to install, it is good practise to alternatively create a require-
ments.txt using the package pip install pipreqs. We can then create the txt in cmd using pipreqs -f
directory_path, where -f overwrites any existing requirements.txt file.
Below is how the contents in a requirements.txt file looks like. After creating the file, and activating the VM, install
the packages at one go using pip install -r requirements.txt.
pika==1.1.0
scipy==1.4.1
scikit_image==0.16.2
numpy==1.18.1
# package from github, not present in pip
git+https://github.com/cftang0827/pedestrian_detection_ssdlite
# wheel file stored in a website
--find-links https://dl.fbaipublicfiles.com/detectron2/wheels/cu101/index.html
detectron2
--find-links https://download.pytorch.org/whl/torch_stable.html
torch==1.5.0+cu101
torchvision==0.6.0+cu101
1.2 Modeling
A parsimonious model is a the model that accomplishes the desired level of prediction with as few predictor variables
as possible.
1.2.1 Variables
The type of data is essential as it determines what kind of tests can be applied to it.
Continuous: Also known as quantitative. Unlimited number of values
Categorical: Also known as discrete or qualitative. Fixed number of values or categories
The best predictive algorithm is one that has good Generalization Ability. With that, it will be able to give accurate
predictions to new and previously unseen data.
High Bias results from Underfitting the model. This usually results from erroneous assumptions, and cause the model
to be too general.
High Variance results from Overfitting the model, and it will predict the training dataset very accurately, but not with
unseen new datasets. This is because it will fit even the slightless noise in the dataset.
The best model with the highest accuarcy is the middle ground between the two.
1. Remove features that have too many NAN or fill NAN with another value
2. Remove features that will introduce data leakage
3. Encode categorical features into integers
4. Extract new useful features (between and within current features)
1.2. Modeling 5
Data Science Documentation, Release 0.1
With the exception of Tree models and Naive Bayes, other machine learning techniques like Neural Networks, KNN,
SVM should have their features scaled.
Split the dataset into Train and Test datasets. By default, sklearn assigns 75% to train & 25% to test randomly. A
random state (seed) can be selected to fixed the randomisation
Create Model
clf = DecisionTreeClassifier()
Fit Model
Test Model
Test the model by predicting identity of unseen data using the testing dataset.
y_predict = model.predict(X_test)
Score Model
1.2. Modeling 7
Data Science Documentation, Release 0.1
import sklearn.metrics
print sklearn.metrics.accuracy_score(y_test, y_predict)*100, '%'
>>> 97.3684210526 %
Cross Validation
When all code is working fine, remove the train-test portion and use Grid Search Cross Validation to compute the best
parameters with cross validation.
Final Model
Finally, rebuild the model using the full dataset, and the chosen parameters tested.
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
# models to test
svml = LinearSVC()
svm = SVC()
rf = RandomForestClassifier()
xg = XGBClassifier()
xr = ExtraTreesClassifier()
# iterations
classifiers = [svml, svm, rf, xr, xg]
names = ['Linear SVM', 'RBF SVM', 'Random Forest', 'Extremely Randomized Trees',
˓→'XGBoost']
results = []
# train-test split
X = df[df.columns[:-1]]
# normalise data for SVM
X = StandardScaler().fit(X).transform(X)
y = df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
1.2. Modeling 9
Data Science Documentation, Release 0.1
Learning
2.1 Datasets
There are in-built datasets provided in both statsmodels and sklearn packages.
2.1.1 Statsmodels
In statsmodels, many R datasets can be obtained from the function sm.datasets.get_rdataset(). To view
each dataset’s description, use print(duncan_prestige.__doc__).
https://www.statsmodels.org/devel/datasets/index.html
import statsmodels.api as sm
prestige = sm.datasets.get_rdataset("Duncan", "car", cache=True).data
print prestige.head()
11
Data Science Documentation, Release 0.1
2.1.2 Sklearn
There are five common toy datasets here. For others, view http://scikit-learn.org/stable/datasets/index.html. To view
each dataset’s description, use print boston['DESCR'].
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
0 setosa
1 setosa
2 setosa
3 setosa
4 setosa
Name: species, dtype: category
(continues on next page)
12 Chapter 2. Learning
Data Science Documentation, Release 0.1
2.1.3 Vega-Datasets
Not in-built but can be install via pip install vega_datasets. More at https://github.com/jakevdp/vega_
datasets.
>>> data.list_datasets()
['7zip', 'airports', 'anscombe', 'barley', 'birdstrikes', 'budget', \
'budgets', 'burtin', 'cars', 'climate', 'co2-concentration', 'countries', \
'crimea', 'disasters', 'driving', 'earthquakes', 'ffox', 'flare', \
'flare-dependencies', 'flights-10k', 'flights-200k', 'flights-20k', \
'flights-2k', 'flights-3m', 'flights-5k', 'flights-airport', 'gapminder', \
'gapminder-health-income', 'gimp', 'github', 'graticule', 'income', 'iris', \
'jobs', 'londonBoroughs', 'londonCentroids', 'londonTubeLines', 'lookup_groups', \
'lookup_people', 'miserables', 'monarchs', 'movies', 'normal-2d', 'obesity', \
'points', 'population', 'population_engineers_hurricanes', 'seattle-temps', \
'seattle-weather', 'sf-temps', 'sp500', 'stocks', 'udistrict', 'unemployment', \
'unemployment-across-industries', 'us-10m', 'us-employment', 'us-state-capitals', \
'weather', 'weball26', 'wheat', 'world-110m', 'zipcodes']
2.2 Kaggle
Kaggle is the most recognised online data science competition, with attractive rewards and recognition for being the
top competitor. With a point system that encourages sharing, one can learnt from the top practitioners in the world.
There are 4 types of expertise medals for specific work, namely Competition, Dataset, Notebook, and Discussion
medals. For expertise, it is possible to obtain bronze, silver and gold medals.
Performance Tier is an overall recognition for each of the expertise stated above, base on the number of medals
accumulated. The various rankings are Novice, Contributor, Expert, Master, and Grandmaster.
More at https://www.kaggle.com/progression
2.2. Kaggle 13
Data Science Documentation, Release 0.1
Kaggle’s notebook has a dedicated GPU and decent RAM for deep-learning neural networks.
For installation of new packages, check “internet” under “Settings” in the right panel first, then in the notebook cell,
!pip install package.
To read dataset, you can see the file path at the right panel for “Data”. It goes something like /kaggle/input/
competition_folder_name.
To download/export the prediction for submission, we can save the prediction like df_submission.to_csv(r'/
kaggle/working/submisson.csv', index=False).
To do a direct submission, we can commit the notebook, with the output saving directly as submission.csv, e.g.,
df_submission.to_csv(r'submisson.csv', index=False).
14 Chapter 2. Learning
CHAPTER 3
Exploratory Analysis
Exploratory data analysis (EDA) is an essential step to understand the data better; in order to engineer and select
features before modelling. This often requires skills in visualisation to better interpret the data.
3.1 Univariate
When plotting distributions, it is important to compare the distribution of both train and test sets. If the test set very
specific to certain features, the model will underfit and have a low accuarcy.
import seaborn as sns
import matplotlib.pyplot as plt
%config InlineBackend.figure_format = 'retina'
%matplotlib inline
for i in X.columns:
plt.figure(figsize=(15,5))
sns.distplot(X[i])
sns.distplot(pred[i])
For categorical features, you may want to see if they have enough sample size for each category.
import seaborn as sns
import matplotlib.pyplot as plt
%config InlineBackend.figure_format = 'retina'
%matplotlib inline
df['Wildnerness'].value_counts()
(continues on next page)
15
Data Science Documentation, Release 0.1
cmap = sns.color_palette("Set2")
sns.countplot(x='Wildnerness',data=df, palette=cmap);
plt.xticks(rotation=45);
To check for possible relationships with the target, place the feature under hue.
plt.figure(figsize=(12,6))
sns.countplot(x='Cover_Type',data=wild, hue='Wilderness');
plt.xticks(rotation=45);
Multiple Plots
fig, axes = plt.subplots(ncols=3, nrows=1, figsize=(15, 5)) # note only for 1 row or
˓→1 col, else need to flatten nested list in axes
col = ['Winner','Second','Third']
for ax in fig.axes:
plt.sca(ax)
plt.xticks(rotation=90)
Using the 50 percentile to compare among different classes, it is easy to find feature that can have high prediction
importance if they do not overlap. Also can be use for outlier detection. Features have to be continuous.
From different dataframes, displaying the same feature.
df.boxplot(figsize=(10,5));
3.1. Univariate 17
Data Science Documentation, Release 0.1
plt.figure(figsize=(7, 5))
cmap = sns.color_palette("Set3")
sns.boxplot(x='Cover_Type', y='Elevation', data=df, palette=cmap);
plt.xticks(rotation=45);
Multiple Plots
cmap = sns.color_palette("Set2")
3.1. Univariate 19
Data Science Documentation, Release 0.1
3.2 Multi-Variate
Using seaborn
plt.figure(figsize=(15, 8))
sns.heatmap(corr, cmap=sns.color_palette("RdBu_r", 20));
3.2. Multi-Variate 21
Data Science Documentation, Release 0.1
Feature Preprocessing
Machine learning models cannot accept null/NaN values. We will need to either remove them or fill them with a
logical value. To investigate how many nulls in each column:
def null_analysis(df):
'''
desc: get nulls for each column in counts & percentages
arg: dataframe
return: dataframe
'''
null_cnt = df.isnull().sum() # calculate null counts
null_cnt = null_cnt[null_cnt!=0] # remove non-null cols
null_percent = null_cnt / len(df) * 100 # calculate null percentages
null_table = pd.concat([pd.DataFrame(null_cnt), pd.DataFrame(null_percent)], axis=1)
null_table.columns = ['counts', 'percentage']
null_table.sort_values('counts', ascending=False, inplace=True)
return null_table
4.1.1 Threshold
It makes no sense to fill in the null values if there are too many of them. We can set a threshold to delete the entire
column if there are too many nulls.
23
Data Science Documentation, Release 0.1
4.1.2 Impute
We can change missing values for the entire dataframe into their individual column means or medians.
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
4.1.3 Interpolation
We can also use interpolation via pandas default function to fill in the missing values. https://pandas.pydata.org/
pandas-docs/stable/reference/api/pandas.Series.interpolate.html
import pandas as pd
4.2 Outliers
Especially sensitive in linear models. They can be (1) removed manually by defining the lower and upper bound limit,
or (2) grouping the features into ranks.
Below is a simple method to detect & remove outliers that is defined by being outside a boxplot’s whiskers.
def boxplot_outlier_removal(X, exclude=['']):
'''
remove outliers detected by boxplot (Q1/Q3 -/+ IQR*1.5)
Parameters
----------
X : dataframe
dataset to remove outliers from
exclude : list of str
column names to exclude from outlier removal
Returns
(continues on next page)
4.2. Outliers 25
Data Science Documentation, Release 0.1
after = len(X)
diff = before-after
percent = diff/before*100
print('{} ({:.2f}%) outliers removed'.format(diff, percent))
return X
4.3 Encoding
# Test data
df = DataFrame(['A', 'B', 'B', 'C'], columns=['Col'])
df['Fact'] = pd.factorize(df['Col'])[0]
le = preprocessing.LabelEncoder()
df['Lab'] = le.fit_transform(df['Col'])
print(df)
# Col Fact Lab
# 0 A 0 0
# 1 B 1 1
# 2 B 1 1
# 3 C 2 2
One-Hot Encoding: We could use an integer encoding directly, rescaled where needed. This may work for problems
where there is a natural ordinal relationship between the categories, and in turn the integer values, such as labels for
temperature ‘cold’, warm’, and ‘hot’. There may be problems when there is no ordinal relationship and allowing the
representation to lean on any such relationship might be damaging to learning to solve the problem. An example might
be the labels ‘dog’ and ‘cat’.
Each category is one binary field of 1 & 0. Not good if too many categories in a feature. Need to store in sparse matrix.
• Dummies: pd.get_dummies, this converts a string into binary, and splits the columns according to n
categories
• sklearn: sklearn.preprocessing.OneHotEncoder, string has to be converted into numeric, then
stored in a sparse matrix.
Feature Interactions: interactions btw categorical features
• Linear Models & KNN
4.3. Encoding 27
Data Science Documentation, Release 0.1
4.4 Coordinates
It is necessary to define a projection for a coordinate reference system if there is a classification in space, eg k-means
clustering. This basically change the coordinates from a spherical component to a flat surface.
Also take note of spatial auto-correlation.
Feature Normalization
Normalisation is another important concept needed to change all features to the same scale. This allows for faster
convergence on learning, and more uniform influence for all weights. More on sklearn website:
• http://scikit-learn.org/stable/modules/preprocessing.html
Tree-based models is not dependent on scaling, but non-tree models models, very often are hugely dependent on it.
Outliers can affect certain scalers, and it is important to either remove them or choose a scalar that is robust towards
them.
• https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html
• http://benalexkeen.com/feature-scaling-with-scikit-learn/
5.1 Scaling
It standardize features by removing the mean and scaling to unit variance The standard score of a sample x is calculated
as:
z = (x - u) / s
import pandas pd
from sklearn.preprocessing import StandardScaler
X_test_scaled = scaler.transform(X_test)
29
Data Science Documentation, Release 0.1
Another way to normalise is to use the Min Max Scaler, which changes all features to be between 0 and 1, as defined
below:
import pandas pd
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
5.1.3 RobustScaler
Works similarly to standard scaler except that it uses median and quartiles, instead of mean and variance. Good as it
ignores data points that are outliers.
5.1.4 Normalizer
Scales each data point such that the feature vector has a Euclidean length of 1. Often used when the direction of the
data matters, not the length of the feature vector.
5.2 Pipeline
Scaling have a chance of leaking the part of the test data in train-test split into the training data. This is especially
inevitable when using cross-validation.
We can scale the train and test datasets separately to avoid this. However, a more convenient way is to use the pipeline
function in sklearn, which wraps the scaler and classifier together, and scale them separately during cross validation.
Any other functions can also be input here, e.g., rolling window feature extraction, which also have the potential to
have data leakage.
# "scaler" & "svm" can be any name. But they must be placed in the correct order of
˓→processing
pipe.fit(X_train, y_train)
Pipeline(steps=[('scaler', MinMaxScaler(copy=True, feature_range=(0, 1))), ('svm',
˓→SVC(C=1.0, cac
pipe.score(X_test, y_test)
0.95104895104895104
5.3 Persistance
To save the fitted scaler to normalize new datasets, we can save it using pickle or joblib for reusing in the future.
5.2. Pipeline 31
Data Science Documentation, Release 0.1
Feature Engineering
Feature Engineering is one of the most important part of model building. Collecting and creating of relevant features
from existing ones are most often the determinant of a high prediction value.
They can be classified broadly as:
• Aggregations
– Rolling/sliding Window (overlapping)
– Tumbling Window (non-overlapping)
• Transformations
• Decompositions
• Interactions
6.1 Manual
6.1.1 Decomposition
Datetime Breakdown
Very often, various dates and times of the day have strong interactions with your predictors. Here’s a script to pull
those values out.
def extract_time(df):
df['timestamp'] = pd.to_datetime(df['timestamp'])
df['hour'] = df['timestamp'].dt.hour
df['mth'] = df['timestamp'].dt.month
df['day'] = df['timestamp'].dt.day
df['dayofweek'] = df['timestamp'].dt.dayofweek
return df
33
Data Science Documentation, Release 0.1
import holidays
train['holiday'] = train['timestamp'].apply(lambda x: 0 if holidays.US().get(x) is
˓→None else 1)
Time-Series
Decomposing a time-series into trend (long-term), seaonality (short-term), residuals (noise). There are two methods
to decompose:
• Additive—The component is present and is added to the other components to create the overall forecast value.
• Multiplicative—The component is present and is multiplied by the other components to create the overall fore-
cast value
Usually an additive time-series will be used if there are no seasonal variations over time.
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
Fourier Transformation
The Fourier transform (FT) decomposes a function of time (a signal) into its constituent frequencies, i.e., converts
amplitudes into frequencies.
Wavelet Transform
Wavelet transforms are time-frequency transforms employing wavelets. They are similar to Fourier transforms, the
difference being that Fourier transforms are localized only in frequency instead of in time and frequency. There are
various considerations for wavelet transform, including:
• Which wavelet transform will you use, CWT or DWT?
• Which wavelet family will you use?
• Up to which level of decomposition will you go?
• Number of coefficients (vanishing moments)
• What is the right range of scales to use?
• http://ataspinar.com/2018/12/21/a-guide-for-using-the-wavelet-transform-in-machine-learning/
• https://www.kaggle.com/jackvial/dwt-signal-denoising
• https://www.kaggle.com/tarunpaparaju/lanl-earthquake-prediction-signal-denoising
import pywt
6.2 Auto
Automatic generation of new features from existing ones are starting to gain popularity, as it can save a lot of time.
6.2. Auto 35
Data Science Documentation, Release 0.1
6.2.1 Tsfresh
tsfresh is a feature extraction package for time-series. It can extract more than 1200 different features, and filter out
features that are deemed relevant. In essence, it is a univariate feature extractor.
https://tsfresh.readthedocs.io/en/latest/
Extract all possible features
def list_union_df(fault_list):
'''
Description
------------
Convert list of faults with a single signal value into a dataframe with an id for
˓→each fault sample
df = pd.concat(dflist)
return df
df = list_union_df(fault_list)
# tsfresh
extracted_features = extract_features(df, column_id='id')
# delete columns which only have one value for all rows
for i in extracted_features.columns:
col = extracted_features[i]
if len(col.unique()) == 1:
del extracted_features[i]
features_filtered_direct = extract_relevant_features(timeseries, y,
column_id='id',
column_sort='time_steps',
fdr_level=0.05)
6.2.2 FeatureTools
FeatureTools is extremely useful if you have datasets with a base data, with other tables that have relationships to it.
We first create an EntitySet, which is like a database. Then we create entities, i.e., individual tables with a unique id
for each table, and showing their relationships between each other.
https://github.com/Featuretools/featuretools
import featuretools as ft
def make_entityset(data):
es = ft.EntitySet('Dataset')
es.entity_from_dataframe(dataframe=data,
entity_id='recordings',
index='index',
time_index='time')
es.normalize_entity(base_entity_id='recordings',
new_entity_id='engines',
index='engine_no')
es.normalize_entity(base_entity_id='recordings',
new_entity_id='cycles',
index='time_in_cycles')
return es
es = make_entityset(data)
es
We then use something called Deep Feature Synthesis (dfs) to generate features automatically.
Primitives are the type of new features to be extracted from the datasets. They can be aggregations (data is combined)
or transformation (data is changed via a function) type of extractors. The list can be found via ft.primitives.
list_primitives(). External primitives like tsfresh, or custom calculations can also be input into FeatureTools.
FeatureTools appears to be a very powerful auto-feature extractor. Some resources to read further are as follows:
• https://brendanhasz.github.io/2018/11/11/featuretools
• https://towardsdatascience.com/automated-feature-engineering-in-python-99baf11cc219
• https://medium.com/@rrfd/simple-automatic-feature-engineering-using-featuretools-in-python-for-classification-b1308040e183
6.2. Auto 37
Data Science Documentation, Release 0.1
Class Imbalance
In domains like predictive maintenance, machine failures are usually rare occurrences in the lifetime of the assets com-
pared to normal operation. This causes an imbalance in the label distribution which usually causes poor performance
as algorithms tend to classify majority class examples better at the expense of minority class examples as the total
misclassification error is much improved when majority class is labeled correctly. Techniques are available to correct
for this.
The imbalance-learn package provides an excellent range of algorithms for adjusting for imbalanced
data. Install with pip install -U imbalanced-learn or conda install -c conda-forge
imbalanced-learn.
An important thing to note is that resampling must be done AFTER the train-test split, so as to prevent data leakage.
7.1 Over-Sampling
SMOTE (synthetic minority over-sampling technique) is a common and popular up-sampling technique.
smote = SMOTE()
X_resampled, y_resampled = smote.fit_sample(X_train, y_train)
clf = LogisticRegression()
clf.fit(X_resampled, y_resampled)
ada = ADASYN()
X_resampled, y_resampled = ada.fit_sample(X_train, y_train)
clf = LogisticRegression()
clf.fit(X_resampled, y_resampled)
39
Data Science Documentation, Release 0.1
7.2 Under-Sampling
rus = RandomUnderSampler()
X_resampled, y_resampled = rus.fit_sample(X_train, y_train)
clf = LogisticRegression()
clf.fit(X_resampled, y_resampled)
7.3 Under/Over-Sampling
SMOTEENN combines SMOTE with Edited Nearest Neighbours, which is used to pare down and centralise the
negative cases.
smo = SMOTEENN()
X_resampled, y_resampled = smo.fit_sample(X_train, y_train)
clf = LogisticRegression()
clf.fit(X_resampled, y_resampled)
One can also make the classifier aware of the imbalanced data by incorporating the weights of the classes into a cost
function. Intuitively, we want to give higher weight to minority class and lower weight to majority class.
http://albahnsen.github.io/CostSensitiveClassification/index.html
Data Leakage
Data leakage is a serious bane in machine learning, which usually results in overly optimistic model results.
8.1 Examples
41
Data Science Documentation, Release 0.1
Supervised Learning
9.1 Classification
Fig. 1: www.mathworks.com
Note:
1. Distance Metric: Eclidean Distance (default). In sklearn it is known as (Minkowski with p = 2)
2. How many nearest neighbour: k=1 very specific, k=5 more general model. Use nearest k data points to
determine classification
3. Weighting function on neighbours: (optional)
4. How to aggregate class of neighbour points: Simple majority (default)
45
Data Science Documentation, Release 0.1
Uses gini index (default) or entropy to split the data at binary level.
Strengths: Can select a large number of features that best determine the targets.
Weakness: Tends to overfit the data as it will split till the end. Pruning can be done to remove the leaves to prevent
overfitting but that is not available in sklearn. Small changes in data can lead to different splits. Not very reproducible
for future data (see random forest).
More more tuning parameters https://medium.com/@mohtedibf/indepth-parameter-tuning-for-decision-tree-6753118a03c3
###### IMPORT MODULES #######
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
print test_predictor.shape
print train_predictor.shape
(38, 4)
(112, 4)
print sklearn.metrics.confusion_matrix(test_target,predictions)
print sklearn.metrics.accuracy_score(test_target, predictions)*100, '%'
[[14 0 0]
[ 0 13 0]
[ 0 1 10]]
97.3684210526 %
Viewing the decision tree requires installing of the two packages conda install graphviz & conda install pydotplus.
9.1. Classification 47
Data Science Documentation, Release 0.1
Parameters to tune decision trees include maxdepth & min sample leaf.
9.1. Classification 49
Data Science Documentation, Release 0.1
is, allowing for the possibility of picking the same row again at each selection. You repeat this random
selection process N times. The resulting bootstrap sample has N rows just like the original training set but
with possibly some rows from the original dataset missing and others occurring multiple times just due to
the nature of the random selection with replacement. This process is repeated to generate n samples, using
the parameter n_estimators, which will eventually generate n number decision trees.
• Splitting Features: When picking the best split for a node, instead of finding the best split across all
possible features (decision tree), a random subset of features is chosen and the best split is found within
that smaller subset of features. The number of features in the subset that are randomly considered at each
stage is controlled by the max_features parameter.
This randomness in selecting the bootstrap sample to train an individual tree in a forest ensemble, combined with the
fact that splitting a node in the tree is restricted to random subsets of the features of the split, virtually guarantees that
all of the decision trees and the random forest will be different.
print train_feature.shape
print test_feature.shape
(404, 13)
(102, 13)
9.1. Classification 51
Data Science Documentation, Release 0.1
accuracy
82.3529411765 %
confusion matrix
[[21 0 3]
[ 0 21 4]
[ 8 3 42]]
RM 0.225612
LSTAT 0.192478
CRIM 0.108510
DIS 0.088056
AGE 0.074202
NOX 0.067718
B 0.057706
PTRATIO 0.051702
TAX 0.047568
INDUS 0.037871
RAD 0.026538
ZN 0.012635
CHAS 0.009405
# see how many decision trees are minimally required make the accuarcy consistent
import numpy as np
import matplotlib.pylab as plt
import seaborn as sns
%matplotlib inline
trees=range(100)
accuracy=np.zeros(100)
for i in range(len(trees)):
clf=RandomForestClassifier(n_estimators= i+1)
model=clf.fit(train_feature, train_target)
predictions=model.predict(test_feature)
accuracy[i]=sklearn.metrics.accuracy_score(test_target, predictions)
plt.plot(trees,accuracy)
# well, seems like more than 10 trees will have a consistent accuracy of 0.82.
# Guess there's no need to have an ensemble of 100 trees!
Gradient Boosted Decision Trees (GBDT) builds a series of small decision trees, with each tree attempting to correct
errors from previous stage. Here’s a good video on it, which describes AdaBoost, but gives a good overview of tree
boosting models.
Typically, gradient boosted tree ensembles use lots of shallow trees known in machine learning as weak learners. Built
in a nonrandom way, to create a model that makes fewer and fewer mistakes as more trees are added. Once the model
is built, making predictions with a gradient boosted tree models is fast and doesn’t use a lot of memory.
learning_rate parameter controls how hard each tree tries to correct mistakes from previous round. High learning
rate, more complex trees.
Key parameters, n_estimators, learning_rate, max_depth.
9.1. Classification 53
Data Science Documentation, Release 0.1
# Default Parameters
clf = GradientBoostingClassifier(random_state = 0)
clf.fit(X_train, y_train)
clf.fit(X_train, y_train)
# Results
Breast cancer dataset (learning_rate=0.1, max_depth=3)
Accuracy of GBDT classifier on training set: 1.00
Accuracy of GBDT classifier on test set: 0.96
9.1.6 XGBoost
XGBoost or eXtreme Gradient Boosting, is a form of gradient boosted decision trees is that designed to be highly effi-
cient, flexible and portable. It is one of the highly dominative classifier in competitive machine learning competitions.
https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/#
# evaluate predictions
(continues on next page)
9.1.7 LightGBM
LightGBM (Light Gradient Boosting) is a lightweight version of gradient boosting developed by Microsoft. It has
similar performance to XGBoost but can run much faster than it.
https://lightgbm.readthedocs.io/en/latest/index.html
import lightgbm
parameters = {
'application': 'binary',
'objective': 'binary',
'metric': 'auc',
'is_unbalance': 'true',
'boosting': 'gbdt',
'num_leaves': 31,
'feature_fraction': 0.5,
'bagging_fraction': 0.5,
'bagging_freq': 20,
'learning_rate': 0.05,
'verbose': 0
}
model = lightgbm.train(parameters,
train_data,
valid_sets=test_data,
num_boost_round=5000,
early_stopping_rounds=100)
9.1.8 CatBoost
Category Boosting has high performance compared to other popular models, and does not require conversion of
categorical values into numbers. It is said to be even faster than LighGBM, and allows model to be ran using GPU.
For easy use, run in Colab & switch runtime to GPU.
More:
• https://catboost.ai
• https://github.com/catboost/tutorials/blob/master/classification/classification_tutorial.ipynb
9.1. Classification 55
Data Science Documentation, Release 0.1
# Model Fitting
# verbose, gives output every n iteration
model.fit(X_train, y_train,
cat_features=cat_features,
eval_set=(X_test, y_test),
verbose=5,
task_type='GPU')
# Get Parameters
model.get_all_params
# Evaluation
model.get_feature_importance(prettified=True)
We can also use k-fold cross validation for better scoring evaluation. There is no need to specify
CatBoostRegressor or CatBoostClassifier, just input the correct eval_metric. One of the folds will
be used as a validation set.
More: https://catboost.ai/docs/concepts/python-reference_cv.html
# CV scores
scores = catboost.cv(cv_dataset, params, fold_count=5)
scores
Naive Bayes is a probabilistic model. Features are assumed to be independent of each other in a given class. This
makes the math very easy. E.g., words that are unrelated multiply together to form the final probability.
Prior Probability: Pr(y). Probability that a class (y) occurred in entire training dataset
Liklihood: Pr(y|xi) Probability of a class (y) occuring given all the features (xi).
There are 3 types of Naive Bayes:
Sklearn allows partial fitting, i.e., fit the model incrementally if dataset is too large for memory.
Naive Bayes model only have one smoothing parameter called alpha (default 0.1). It adds a virtual data point that
have positive values for all features. This is necessary considering that if there are no positive feature, the entire
probability will be 0 (since it is a multiplicative model). More alpha means more smoothing, and more generalisation
(less complex) model.
from sklearn.naive_bayes import GaussianNB
from adspy_shared_utilities import plot_class_regions_for_classifier
9.1. Classification 57
Data Science Documentation, Release 0.1
Binary output or y value. Functions are available in both statsmodels and sklearn packages.
Lower CI Upper CI OR
depth 61.625434 80.209893 70.306255
layers_YESNO 0.112824 0.130223 0.121212
h = 6
w = 8
print('A fruit with height {} and width {} is predicted to be: {}'
.format(h,w, ['not an apple', 'an apple'][clf.predict([[h,w]])[0]]))
h = 10
w = 7
print('A fruit with height {} and width {} is predicted to be: {}'
.format(h,w, ['not an apple', 'an apple'][clf.predict([[h,w]])[0]]))
subaxes.set_xlabel('height')
subaxes.set_ylabel('width')
9.1. Classification 59
Data Science Documentation, Release 0.1
Support Vector Machines (SVM) involves locating the support vectors of two boundaries to find a maximum tolerance
hyperplane. Side note: linear kernels work best for text classification.
We can directly call a linear SVC by directly importing the LinearSVC function
from sklearn.svm import LinearSVC
X_train, X_test, y_train, y_test = train_test_split(X_cancer, y_cancer, random_state
˓→= 0)
Multi-Class Classification, i.e., having more than 2 target values, is also possible. With the results, it is possible to
compare one class versus all other classes.
9.1. Classification 61
Data Science Documentation, Release 0.1
visualising in a graph. . .
plt.figure(figsize=(6,6))
colors = ['r', 'g', 'b', 'y']
cmap_fruits = ListedColormap(['#FF0000', '#00FF00', '#0000FF','#FFFF00'])
plt.scatter(X_fruits_2d[['height']], X_fruits_2d[['width']],
c=y_fruits_2d, cmap=cmap_fruits, edgecolor = 'black', alpha=.7)
# and the decision boundary is defined as being all points with y = 0, to plot x_
˓→1 as a
# function of x_0 we just solve w_0 x_0 + w_1 x_1 + b = 0 for x_1:
plt.plot(x_0_range, -(x_0_range * w[0] + b) / w[1], c=color, alpha=.8)
plt.legend(target_names_fruits)
plt.xlabel('height')
plt.ylabel('width')
plt.xlim(-2, 12)
plt.ylim(-2, 15)
plt.show()
Full tuning in Support Vector Machines, using normalisation, kernel tuning, and regularisation.
Parameters include
• hidden_layer_sizes which is the number of hidden layers, with no. units in each layer (default
100).
• solvers is the algorithm usedthat does the numerical work of finding the optimal weights. default adam
used for large datasets, lbfgs is used for smaller datasets.
• alpha: L2 regularisation, default is 0.0001,
• activation: non-linear function used for activation function which include relu (default),
logistic, tanh
One Hidden Layer
from sklearn.neural_network import MLPClassifier
from adspy_shared_utilities import plot_class_regions_for_classifier_subplot
9.1. Classification 63
Data Science Documentation, Release 0.1
Fig. 10: Activation Function. University of Michigan: Coursera Data Science in Python
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# RESULTS
Breast cancer dataset
Accuracy of NN classifier on training set: 0.98
Accuracy of NN classifier on test set: 0.97
9.1. Classification 65
Data Science Documentation, Release 0.1
9.2 Regression
Ordinary Least Squares Regression or OLS Regression is the most basic form and fundamental of regression. Best
fit line ŷ = a + bx is drawn based on the ordinary least squares method. i.e., least total area of squares (sum of
squares) with length from each x,y point to regresson line.
OLS can be conducted using statsmodel package.
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly
˓→specified.
reg = linear_model.LinearRegression()
model = reg.fit ([[0, 0], [1, 1], [2, 2]], [0, 1, 2])
model
(continues on next page)
# R2 scores
r2_trains = model.score(X_train, y_train)
r2_tests = model.score(X_test, y_test)
Regularisaton is an important concept used in Ridge Regression as well as the next LASSO regression. Ridge
regression uses regularisation which adds a penalty parameter to a variable when it has a large variation. Regularisation
prevents overfitting by restricting the model, thus lowering its complexity.
• Uses L2 regularisation, which reduces the sum of squares of the parameters
• The influence of L2 is controlled by an alpha parameter. Default is 1.
• High alpha means more regularisation and a simpler model.
• More in https://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-ridge-lasso-regression-python/
print('Crime dataset')
print('ridge regression linear model intercept: {}'
.format(linridge.intercept_))
print('ridge regression linear model coeff:\n{}'
.format(linridge.coef_))
print('R-squared score (training): {:.3f}'
.format(linridge.score(X_train_scaled, y_train)))
print('R-squared score (test): {:.3f}'
.format(linridge.score(X_test_scaled, y_test)))
print('Number of non-zero features: {}'
.format(np.sum(linridge.coef_ != 0)))
9.2. Regression 67
Data Science Documentation, Release 0.1
Note:
• Many variables with small/medium effects: Ridge
• Only a few variables with medium/large effects: LASSO
LASSO refers to Least Absolute Shrinkage and Selection Operator Regression. Like Ridge Regression this also has a
regularisation property.
• Uses L1 regularisation, which reduces sum of the absolute values of coefficients, that change unimportant fea-
tures (their regression coefficients) into 0
• This is known as a sparse solution, or a kind of feature selection, since some variables were removed in the
process
• The influence of L1 is controlled by an alpha parameter. Default is 1.
• High alpha means more regularisation and a simpler model. When alpha = 0, then it is a normal OLS regression.
a. Bias increase & variability decreases when alpha increases.
b. Useful when there are many features (explanatory variables).
c. Have to standardize all features so that they have mean 0 and std error 1.
d. Have several algorithms: LAR (Least Angle Regression). Starts w 0 predictors & add each predictor that is most
correlated at each step.
print train_feature.shape
print test_feature.shape
(404, 13)
(102, 13)
df2=pd.DataFrame(model.coef_, index=feature.columns)
df2.sort_values(by=0,ascending=False)
RM 3.050843
RAD 2.040252
ZN 1.004318
B 0.629933
CHAS 0.317948
INDUS 0.225688
AGE 0.000000
CRIM -0.770291
NOX -1.617137
TAX -1.731576
PTRATIO -1.923485
DIS -2.733660
LSTAT -3.878356
9.2. Regression 69
Data Science Documentation, Release 0.1
# Polynomial Regression
poly = PolynomialFeatures(degree=2)
(continues on next page)
9.2. Regression 71
Data Science Documentation, Release 0.1
Unsupervised Learning
10.1 Transformations
73
Data Science Documentation, Release 0.1
PCA summarises multiple fields of data into principal components, usually just 2 so that it is easier to visualise in
a 2-dimensional plot. The 1st component will show the most variance of the entire dataset in the hyperplane, while
the 2nd shows the 2nd shows the most variance at a right angle to the 1st. Because of the strong variance between
data points, patterns tend to be teased out from a high dimension to even when there’s just two dimensions. These 2
components can serve as new features for a supervised analysis.
In short, PCA finds the best possible characteristics, that summarises the classes of a feature. Two excellent sites
elaborate more: setosa, quora. The most challenging part of PCA is interpreting the components.
cancer = load_breast_cancer()
df = pd.DataFrame(cancer['data'],columns=cancer['feature_names'])
# Before applying PCA, each feature should be centered (zero mean) and with unit
˓→variance
scaled_data = StandardScaler().fit(df).transform(df)
x_pca = pca.transform(scaled_data)
print(df.shape, x_pca.shape)
# RESULTS
(569, 30) (569, 2)
percent = pca.explained_variance_ratio_
print(percent)
print(sum(percent))
Alteratively, we can write a function to determine how much components we should reduce it by.
def pca_explained(X, threshold):
'''
prints optimal principal components based on threshold of PCA's explained variance
Parameters
----------
X : dataframe or array
of features
threshold : float < 1
percentage of explained variance as cut off point
'''
10.1. Transformations 75
Data Science Documentation, Release 0.1
pca_explained(X, 0.85)
# 2 components at 61.64% explained variance
# 3 components at 77.41% explained variance
# 4 components at 86.63% explained variance
Plotting the PCA-transformed version of the breast cancer dataset. We can see that malignant and benign cells cluster
between two groups and can apply a linear classifier to this two dimensional representation of the dataset.
plt.figure(figsize=(8,6))
plt.scatter(x_pca[:,0], x_pca[:,1], c=cancer['target'], cmap='plasma', alpha=0.4,
˓→edgecolors='black', s=65);
Plotting the magnitude of each feature value for the first two principal components. This gives the best explanation for
the components for each field.
fig = plt.figure(figsize=(8, 4))
plt.imshow(pca.components_, interpolation = 'none', cmap = 'plasma')
feature_names = list(cancer.feature_names)
plt.gca().set_xticks(np.arange(-.5, len(feature_names)));
plt.gca().set_yticks(np.arange(0.5, 2));
plt.gca().set_xticklabels(feature_names, rotation=90, ha='left', fontsize=12);
plt.gca().set_yticklabels(['First PC', 'Second PC'], va='bottom', fontsize=12);
(continues on next page)
plt.colorbar(orientation='horizontal', ticks=[pca.components_.min(), 0,
pca.components_.max()], pad=0.65);
We can also plot the feature magnitudes in the scatterplot like in R into two separate axes, also known as a biplot. This
shows the relationship of each feature’s magnitude clearer in a 2D space.
# plot size
plt.figure(figsize=(10,8))
# main scatterplot
plt.scatter(x_pca[:,0], x_pca[:,1], c=cancer['target'], cmap='plasma', alpha=0.4,
˓→edgecolors='black', s=40);
# reference lines
ax2.hlines(0,-0.5,0.5, linestyles='dotted', colors='grey')
ax2.vlines(0,-0.5,0.5, linestyles='dotted', colors='grey')
10.1. Transformations 77
Data Science Documentation, Release 0.1
Lastly, we can specify the percentage explained variance, and let PCA decide on the number components.
from sklearn.decomposition import PCA
pca = PCA(0.99)
df_pca = pca.fit_transform(df)
Multi-Dimensional Scaling
Multi-Dimensional Scaling (MDS) is a type of manifold learning algorithm that to visualize a high dimensional dataset
and project it onto a lower dimensional space - in most cases, a two-dimensional page. PCA is weak in this aspect.
sklearn gives a good overview of various manifold techniques. https://scikit-learn.org/stable/modules/manifold.html
from adspy_shared_utilities import plot_labelled_scatter
from sklearn.preprocessing import StandardScaler
from sklearn.manifold import MDS
(continues on next page)
# each feature should be centered (zero mean) and with unit variance
X_fruits_normalized = StandardScaler().fit(X_fruits).transform(X_fruits)
mds = MDS(n_components = 2)
X_fruits_mds = mds.fit_transform(X_fruits_normalized)
t-SNE
t-Distributed Stochastic Neighbor Embedding (t-SNE) is a powerful manifold learning algorithm for visualizing clus-
ters. It finds a two-dimensional representation of your data, such that the distances between points in the 2D scatterplot
match as closely as possible the distances between the same points in the original high dimensional dataset. In partic-
ular, t-SNE gives much more weight to preserving information about distances between points that are neighbors.
More information here.
10.1. Transformations 79
Data Science Documentation, Release 0.1
tsne = TSNE(random_state = 0)
X_tsne = tsne.fit_transform(X_fruits_normalized)
plot_labelled_scatter(X_tsne, y_fruits,
['apple', 'mandarin', 'orange', 'lemon'])
plt.xlabel('First t-SNE feature')
plt.ylabel('Second t-SNE feature')
plt.title('Fruits dataset t-SNE');
Fig. 2: You can see how some dimensionality reduction methods may be less successful on some datasets. Here, it
doesn’t work as well at finding structure in the small fruits dataset, compared to other methods like MDS.
LDA
Latent Dirichlet Allocation is another dimension reduction method, but unlike PCA, it is a supervised method. It
attempts to find a feature subspace or decision boundary that maximizes class separability. It then projects the data
points to new dimensions in a way that the clusters are as separate from each other as possible and the individual
elements within a cluster are as close to the centroid of the cluster as possible.
Differences of PCA & LDA, from:
• https://sebastianraschka.com/Articles/2014_python_lda.html
• https://stackabuse.com/implementing-lda-in-python-with-scikit-learn/
Self-Organzing Maps
SOM is a special type of neural network that is trained using unsupervised learning to produce a two-dimensional map.
Each row of data is assigned to its Best Matching Unit (BMU) neuron. Neighbourhood effect to create a topographic
map
They differ from other artificial neural networks as:
1. they apply competitive learning as opposed to error-correction learning (such as backpropagation with
gradient descent)
2. in the sense that they use a neighborhood function to preserve the topological properties of the input space.
3. Consist of only one visible output layer
Requires scaling or normalization of all features first.
https://github.com/JustGlowing/minisom
10.1. Transformations 81
Data Science Documentation, Release 0.1
We first need to calculate the number of neurons and how many of them making up each side. The ratio of the side
lengths of the map is approximately the ratio of the two largest eigenvalues of the training data’s covariance matrix.
# calculate eigen_values
normal_cov = np.cov(data_normal)
eigen_values = np.linalg.eigvals(normal_cov)
# 2 largest eigenvalues
result = sorted([i.real for i in eigen_values])[-2:]
ratio_2_largest_eigen = result[1]/result[0]
side = total_neurons/ratio_2_largest_eigen
# two sides
print(total_neurons)
print('1st side', side)
print('2nd side', ratio_2_largest_eigen)
plt.figure(figsize=(6, 5))
plt.pcolor(som.distance_map().T, cmap='bone_r')
Quantization error is the distance between each vector and the BMU.
som.quantization_error(array)
10.2 Clustering
Find groups in data & assign every point in the dataset to one of the groups.
The below set of codes allows assignment of each cluster to their original cluster attributes, or further comparison of
the accuracy of prediction. The more a cluster is assigned to a verified label, the higher chance it is that label.
# view percentages
res2 = df.groupby('actual')['cluster'].value_counts(normalize=True)*100
print(res2)
10.2.1 K-Means
Need to specify K number of clusters. It is also important to scale the features before applying K-means, unless the
fields are not meant to be scaled, like distances. Categorical data is not appropriate as clustering calculated using
euclidean distance (means). For long distances over an lat/long coordinates, they need to be projected to a flat surface.
One aspect of k means is that different random starting points for the cluster centers often result in very different
clustering solutions. So typically, the k-means algorithm is run in scikit-learn with ten different random initializations
and the solution occurring the most number of times is chosen.
Downsides
• Very sensitive to outliers. They have to be removed before running the model
• Might need to reduce dimensions if very high no. of features or the distance separation might not be
obvious
• Two variants, K-medians & K-Medoids are less sensitive to outliers (see https://github.com/annoviko/
pyclustering)
Methodology
10.2. Clustering 83
Data Science Documentation, Release 0.1
Example 1
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from adspy_shared_utilities import plot_labelled_scatter
from sklearn.preprocessing import MinMaxScaler
fruits = pd.read_table('fruit_data_with_colors.txt')
X_fruits = fruits[['mass','width','height', 'color_score']].as_matrix()
y_fruits = fruits[['fruit_label']] - 1
X_fruits_normalized = MinMaxScaler().fit(X_fruits).transform(X_fruits)
plot_labelled_scatter(X_fruits_normalized, kmeans.labels_,
['Cluster 1', 'Cluster 2', 'Cluster 3', 'Cluster 4'])
Example 2
#### IMPORT MODULES ####
import pandas as pd
from sklearn import preprocessing
from sklearn.cross_validation import train_test_split
from sklearn.cluster import KMeans
(continues on next page)
10.2. Clustering 85
Data Science Documentation, Release 0.1
df.describe()
print train_feature.shape
print test_feature.shape
(120, 4)
(30, 4)
/ train_feature.shape[0])
plt.plot(clusters, meandist)
plt.xlabel('Number of clusters')
plt.ylabel('Average distance')
plt.title('Selecting k with the Elbow Method')
10.2. Clustering 87
Data Science Documentation, Release 0.1
We can visualise the clusters by reducing the dimensions into 2 using PCA. They are separate by theissen polygons,
though at a multi-dimensional space.
plt.figure(figsize=(8,8))
plt.scatter(pd.DataFrame(pca)[0],pd.DataFrame(pca)[1], c=labels, cmap='plasma',
˓→alpha=0.5);
Sometimes we need to find the cluster centres so that we can get an absolute distance measure of centroids to new
data. Each feature will have a defined centre for each cluster.
If we have labels or y, and want to determine which y belongs to which cluster for an evaluation score, we can use a
groupby to find the most number of labels that fall in a cluster and manually label them as such.
df = concat.groupby(['label','cluster'])['cluster'].count()
If we want to know what is the distance of each datapoint’s assign cluster distance to their centroid, we can do a
fit_transform to get all distance from all cluster centroids and process from there.
GMM is, in essence a density estimation model but can function like clustering. It has a probabilistic model under the
hood so it returns a matrix of probabilities belonging to each cluster for each data point. More: https://jakevdp.github.
io/PythonDataScienceHandbook/05.12-gaussian-mixtures.html
We can input the covariance_type argument such that it can choose between diag (the default, ellipse constrained to
the axes), spherical (like k-means), or full (ellipse without a specific orientation).
BIC or AIC are used to determine the optimal number of clusters using the elbow diagram, the former usually recom-
mends a simpler model. Note that number of clusters or components measures how well GMM works as a density
estimator, not as a clustering algorithm.
10.2. Clustering 89
Data Science Documentation, Release 0.1
input_gmm = normal.values
bic_list = []
aic_list = []
ranges = range(1,30)
for i in ranges:
gmm = GaussianMixture(n_components=i).fit(input_gmm)
# BIC
bic = gmm.bic(input_gmm)
bic_list.append(bic)
# AIC
aic = gmm.aic(input_gmm)
aic_list.append(aic)
plt.figure(figsize=(10, 5))
plt.plot(ranges, bic_list, label='BIC');
plt.plot(ranges, aic_list, label='AIC');
plt.legend(loc='best');
Agglomerative Clustering is a type of hierarchical clustering technique used to build clusters from bottom up. Divisive
Clustering is the opposite method of building clusters from top down, which is not available in sklearn.
Methods of linking clusters together.
10.2. Clustering 91
Data Science Documentation, Release 0.1
AgglomerativeClustering method in sklearn allows clustering to be choosen by the no. clusters or distance
threshold.
X, y = make_blobs(random_state = 10)
cls_assignment = cls.fit_predict(X)
One of the benfits of this clustering is that a hierarchy can be built via a dendrogram. We have to recompute the
clustering using the ward function.
# BUILD DENDROGRAM
from scipy.cluster.hierarchy import ward, dendrogram
Z = ward(X)
plt.figure(figsize=(10,5));
dendrogram(Z, orientation='left', leaf_font_size=8))
plt.show()
More: https://joernhees.de/blog/2015/08/26/scipy-hierarchical-clustering-and-dendrogram-tutorial/
In essence, we can also use the 3-step method above to compute agglomerative clustering.
# 1. clustering
Z = linkage(X, method='ward', metric='euclidean')
# 2. draw dendrogram
plt.figure(figsize=(10,5));
dendrogram(Z, orientation='left', leaf_font_size=8)
plt.show()
# 3. flatten cluster
distance_threshold = 10
y = fcluster(Z, distance_threshold, criterion='distance')
sklearn agglomerative clustering is very slow, and an alternative fastcluster library performs much faster as it is
a C++ library with a python interface.
More: https://pypi.org/project/fastcluster/
import fastcluster
from scipy.cluster.hierarchy import dendrogram, fcluster
# 1. clustering
Z = fastcluster.linkage_vector(X, method='ward', metric='euclidean')
Z_df = pd.DataFrame(data=Z, columns=['clusterOne','clusterTwo','distance',
˓→'newClusterSize'])
# 2. draw dendrogram
plt.figure(figsize=(10, 5))
dendrogram(Z, orientation='left', leaf_font_size=8)
plt.show();
# 3. flatten cluster
(continues on next page)
10.2. Clustering 93
Data Science Documentation, Release 0.1
Then we select the distance threshold to cut the dendrogram to obtain the selected clustering level. The output is the
cluster labelled for each row of data. As expected from the dendrogram, a cut at 2000 gives us 5 clusters.
This link gives an excellent tutorial on prettifying the dendrogram. http://datanongrata.com/2019/04/27/67/
10.2.4 DBSCAN
Density-Based Spatial Clustering of Applications with Noise (DBSCAN). Need to scale/normalise data. DBSCAN
works by identifying crowded regions referred to as dense regions.
Key parameters are eps and min_samples. If there are at least min_samples many data points within a distance of
eps to a given data point, that point will be classified as a core sample. Core samples that are closer to each other than
the distance eps are put into the same cluster by DBSCAN.
There is recently a new method called HDBSCAN (H = Hierarchical). https://hdbscan.readthedocs.io/en/latest/index.
html
Methodology
1. Pick an arbitrary point to start
10.2. Clustering 95
Data Science Documentation, Release 0.1
2. Find all points with distance eps or less from that point
3. If points are more than min_samples within distance of esp, point is labelled as a core sample, and assigned
a new cluster label
4. Then all neighbours within eps of the point are visited
5. If they are core samples their neighbours are visited in turn and so on
6. The cluster thus grows till there are no more core samples within distance eps of the cluster
7. Then, another point that has not been visited is picked, and step 1-6 is repeated
8. 3 kinds of points are generated in the end, core points, boundary points, and noise
9. Boundary points are core clusters but not within distance of esp
cls = dbscan.fit_predict(X)
print("Cluster membership values:\n{}".format(cls))
Cluster membership values:
[ 0 1 0 2 0 0 0 2 2 -1 1 2 0 0 -1 0 0 1 -1 1 1 2 2 2 1]
# -1 indicates noise or outliers
plot_labelled_scatter(X, cls + 1,
['Noise', 'Cluster 0', 'Cluster 1', 'Cluster 2'])
These requires the training of a normal state(s), allows outliers to be detected when they lie outside trained state.
One-class SVM is an unsupervised algorithm that learns a decision function for outlier detection: classifying new data
as similar or different to the training set.
Besides the kernel, two other parameters are impt: The nu parameter should be the proportion of outliers you expect
to observe (in our case around 2%), the gamma parameter determines the smoothing of the contour lines.
from sklearn.svm import OneClassSVM
clf.fit(X_train)
(continues on next page)
# -1 are outliers
y_pred_test
# array([ 1, 1, 1, 1, 1, 1, 1, 1, 1, -1, 1, 1, 1, 1, 1, 1])
We can also get the average anomaly scores. The lower, the more abnormal. Negative scores represent outliers,
positive scores represent inliers.
clf.decision_function(X_test)
array([ 0.14528263, 0.14528263, -0.08450298, 0.14528263, 0.14528263,
0.14528263, 0.14528263, 0.14528263, 0.14528263, -0.14279962,
0.14528263, 0.14528263, -0.05483886, -0.10086102, 0.14528263,
0.14528263])
Euclidean distance is the straight line distance between points, while cosine distance is the cosine of the angle between
these two points.
euclidean([1,2],[1,3])
# 1
cosine([1,2],[1,3])
# 0.010050506338833642
Mahalonobis distance is the distance between a point and a distribution, not between two distinct points. Therefore, it
is effectively a multivariate equivalent of the Euclidean distance.
https://www.machinelearningplus.com/statistics/mahalanobis-distance/
• x: is the vector of the observation (row in a dataset),
• m: is the vector of mean values of independent variables (mean of each column),
• C^(-1): is the inverse covariance matrix of independent variables.
Multiplying by the inverse covariance (correlation) matrix essentially means dividing the input with the matrix. This
is so that if features in your dataset are strongly correlated, the covariance will be high. Dividing by a large covariance
will effectively reduce the distance.
While powerful, its use of correlation can be detrimantal when there is multicollinearity (strong correlations among
features).
import pandas as pd
import numpy as np
from scipy.spatial.distance import mahalanobis
return distanceMD
If two time series are identical, but one is shifted slightly along the time axis, then Euclidean distance may consider
them to be very different from each other. DTW was introduced to overcome this limitation and give intuitive distance
measurements between time series by ignoring both global and local shifts in the time dimension.
DTW is a technique that finds the optimal alignment between two time series, if one time series may be “warped” non-
linearly by stretching or shrinking it along its time axis. Dynamic time warping is often used in speech recognition
to determine if two waveforms represent the same spoken phrase. In a speech waveform, the duration of each spoken
sound and the interval between sounds are permitted to vary, but the overall speech waveforms must be similar.
From the creators of FastDTW, it produces an accurate minimum-distance warp path between two time series than is
nearly optimal (standard DTW is optimal, but has a quadratic time and space complexity).
Output: Identical = 0, Difference > 0
import numpy as np
from scipy.spatial.distance import euclidean
from fastdtw import fastdtw
# 2.8284271247461903
SAX, developed in 2007, compares the similarity of two time-series patterns by slicing them into horizontal & vertical
regions, and comparing between each of them. This can be easily explained by 4 charts provided by https://jmotif.
github.io/sax-vsm_site/morea/algorithm/SAX.html.
There are obvious benefits using such an algorithm, for one, it will be very fast as pattern matching is aggregated.
However, the biggest downside is that both time-series signals have to be of same time-length.
Each signal value, i.e., y-axis is then sliced horizontally into regions, and assigned an alphabet.
Lastly, we use a distance scoring metric, through a fixed lookup table to easily calculate the total scores between each
pair of PAA.
E.g., if the PAA fall in a region or its immediate adjacent one, we assume they are the same, i.e., distance = 0. Else, a
distance value is assigned. The total distance is then computed to derice a distance metric.
For this instance:
• SAX transform of ts1 into string through 9-points PAA: “abddccbaa”
• SAX transform of ts2 into string through 9-points PAA: “abbccddba”
• SAX distance: 0 + 0 + 0.67 + 0 + 0 + 0 + 0.67 + 0 + 0 = 1.34
This is the code from the package saxpy. Unfortunately, it does not have the option of calculating of the sax distance.
import numpy as np
from saxpy.znorm import znorm
from saxpy.paa import paa
from saxpy.sax import ts_to_string
from saxpy.alphabet import cuts_for_asize
sig1a = saxpy_sax(sig1)
sig2a = saxpy_sax(sig2)
Another more mature package is tslearn. It enables the calculation of sax distance, but the sax alphabets are set as
integers instead.
sax_data = sax.fit_transform([sig1_n,sig2_n])
# distance measure
distance = sax.distance_sax(sax_data[0],sax_data[1])
# [[[0]
# [3]
# [3]
# [1]]
# [[0]
# [1]
# [2]
# [3]]]
# 1.8471662549420924
Deep Learning
Deep Learning falls under the broad class of Articial Intelligence > Machine Learning. It is a Machine Learning tech-
nique that uses multiple internal layers (hidden layers) of non-linear processing units (neurons) to conduct supervised
or unsupervised learning from data.
11.1 Introduction
11.1.1 GPU
Tensorflow is able to run faster and more effeciently using Nivida’s GPU pip install tensorflow-gpu.
CUDA as well cudnn are also required. It is best to run your models in Ubuntu as the compliation of some pretrained
models are easier.
11.1.2 Preprocessing
Keras accepts numpy input, so we have to convert. Also, for multi-class classification, we need to
convert them into binary values; i.e., using one-hot encoding. For the latter, we can in-place use
sparse_categorical_crossentropy for the loss function which will can process the multi-class label with-
out converting to one-hot encoding.
It is important to scale or normalise the dataset before putting in the neural network.
105
Data Science Documentation, Release 0.1
model.summary()
11.1.3 Evaluation
The model compiled has a history method (model.history.history) that gives the accuracy and loss for both
train & test sets for each time step. We can plot it out for a better visualization. Alternatively we can also use
TensorBoard, which is installed together with TensorFlow package. It will also draw the model architecture.
if loss_acc == 'loss':
axis_title = 'loss'
title = 'Loss'
epoch = len(history['loss'])
elif loss_acc == 'acc':
axis_title = 'acc'
title = 'Accuracy'
epoch = len(history['loss'])
plt.figure(figsize=(15,4))
plt.plot(history[axis_title])
plt.plot(history['val_' + axis_title])
plt.title('Model ' + title)
plt.ylabel(title)
plt.xlabel('Epoch')
plt.grid(b=True, which='major')
plt.minorticks_on()
plt.grid(b=True, which='minor', alpha=0.2)
plt.legend(['Train', 'Test'])
plt.show()
plot_validate(model, 'acc')
plot_validate(model, 'loss')
11.1.4 Auto-Tuning
Unlike grid-search we can use Bayesian optimization for a faster hyperparameter tuning.
https://www.dlology.com/blog/how-to-do-hyperparameter-search-with-baysian-optimization-for-keras-model/ https:
//medium.com/@crawftv/parameter-hyperparameter-tuning-with-bayesian-optimization-7acf42d348e1
ReLu (Rectified Linear units) is very popular compared to the now mostly obsolete sigmoid & tanh functions because
it avoids vanishing gradient problem and has faster convergence. However, ReLu can only be used in hidden layers.
Also, some gradients can be fragile during training and can die. It can cause a weight update which will makes it never
activate on any data point again. Simply saying that ReLu could result in Dead Neurons.
To fix this problem another modification was introduced called Leaky ReLu to fix the problem of dying neurons. It
introduces a small slope to keep the updates alive. We then have another variant made form both ReLu and Leaky
ReLu called Maxout function .
Output Layer
Activation function
Fig. 3: https://towardsdatascience.com/activation-functions-and-its-types-which-is-better-a9a5310cc8f
Backpropagation, short for “backward propagation of errors,” is an algorithm for supervised learning of artificial neural
networks using gradient descent.
• Optimizer is a learning algorithm called gradient descent, refers to the calculation of an error gradient or slope
of error and “descent” refers to the moving down along that slope towards some minimum level of error.
• Batch Size is a hyperparameter of gradient descent that controls the number of training samples to work through
before the model’s internal parameters are updated.
• Epoch is a hyperparameter of gradient descent that controls the number of complete passes through the training
dataset.
Optimizers is used to find the minimium value of the cost function to perform backward propagation. There are more
advanced adaptive optimizers, like AdaGrad/RMSprop/Adam, that allow the learning rate to adapt to the size of the
gradient. The hyperparameters are essential to get the model to perform well.
The amount that the weights are updated during training is referred to as the step size or the “learning rate.” Specifically,
the learning rate is a configurable hyperparameter used in the training of neural networks that has a small positive
value, often in the range between 0.0 and 1.0. A learning rate that is too large can cause the model to converge
too quickly to a suboptimal solution, whereas a learning rate that is too small can cause the process to get stuck.
(https://machinelearningmastery.com/understand-the-dynamics-of-learning-rate-on-deep-learning-neural-networks/)
Assume you have a dataset with 200 samples (rows of data) and you choose a batch size of 5 and 1,000 epochs. This
means that the dataset will be divided into 40 batches, each with 5 samples. The model weights will be updated after
each batch of 5 samples. This also means that one epoch will involve 40 batches or 40 updates to the model.
More here:
• https://machinelearningmastery.com/difference-between-a-batch-and-an-epoch/.
• https://machinelearningmastery.com/gentle-introduction-mini-batch-gradient-descent-configure-batch-size/
• https://blog.usejournal.com/stock-market-prediction-by-recurrent-neural-network-on-lstm-model-56de700bff68
Fig. 4: From Udemy, Zero to Hero Deep Learning with Python & Keras
11.3 ANN
11.3.1 Theory
An artifical neural network is the most basic form of neural network. It consists of an input layer, hidden layers, and
an output layer. This writeup by Berkeley gave an excellent introduction to the theory. Most of the diagrams are taken
from the site.
Zooming in at a single perceptron, the input layer consists of every individual features, each with an assigned weight
feeding to the hidden layer. An activation function tells the perception what outcome it is.
Activation functions consists of ReLU, Tanh, Linear, Sigmoid, Softmax and many others. Sigmoid is used for binary
classifications, while softmax is used for multi-class classifications.
The backward propagation algorithm works in such that the slopes of gradient descent is calculated by working back-
wards from the output layer back to the input layer. The weights are readjusted to reduce the loss and improve the
accuracy of the model.
A summary is as follows
1. Randomly initialize the weights for all the nodes.
2. For every training example, perform a forward pass using the current weights, and calculate the output of each
node going from left to right. The final output is the value of the last node.
3. Compare the final output with the actual target in the training data, and measure the error using a loss function.
4. Perform a backwards pass from right to left and propagate the error to every individual node using backprop-
agation. Calculate each weight’s contribution to the error, and adjust the weights accordingly using gradient
descent. Propagate the error gradients back starting from the last layer.
Before training, the model needs to be compiled with the learning hyperparameters of optimizer, loss, and metric
functions.
def create_model():
model = Sequential()
model.add(Dense(6, input_dim=4, kernel_initializer='normal', activation='relu'))
#model.add(Dense(4, kernel_initializer='normal', activation='relu'))
model.add(Dense(1, kernel_initializer='normal', activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
return model
# convert the 0-9 labels into "one-hot" format, as we did for TensorFlow.
train_labels = keras.utils.to_categorical(mnist_train_labels, 10)
test_labels = keras.utils.to_categorical(mnist_test_labels, 10)
model = Sequential()
model.add(Dense(512, activation='relu', input_shape=(784,)))
model.add(Dense(10, activation='softmax'))
model.summary()
model.compile(loss='categorical_crossentropy',
optimizer=RMSprop(),
metrics=['accuracy'])
model = Sequential()
model.compile(loss='sparse_categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])
model.fit(X_train, y_train,
batch_size=batch,
epochs= epoch,
verbose=verbose,
validation_data=(X_test, y_test))
return model
iris = load_iris()
X = pd.DataFrame(iris['data'], columns=iris['feature_names'])
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=0)
11.4 CNN
Convolutional Neural Network (CNN) is suitable for unstructured data like image classification, machine translation,
sentence classification, and sentiment analysis.
11.4.1 Theory
This article from medium gives a good introduction of CNN. The steps goes something like this:
1. Provide input image into convolution layer
2. Choose parameters, apply filters with strides, padding if requires. Perform convolution on the image and apply
ReLU activation to the matrix.
3. Perform pooling to reduce dimensionality size. Max-pooling is most commonly used
4. Add as many convolutional layers until satisfied
5. Flatten the output and feed into a fully connected layer (FC Layer)
6. Output the class using an activation function (Logistic Regression with cost functions) and classifies images.
There are many topologies, or CNN architecture to build on as the hyperparameters, layers etc. are endless. Some
specialized architecture includes LeNet-5 (handwriting recognition), AlexNet (deeper than LeNet, image classifica-
tion), GoogLeNet (deeper than AlexNet, includes inception modules, or groups of convolution), ResNet (even deeper,
maintains performance using skip connections). This article1 gives a good summary of each architecture.
import tensorflow
from tensorflow.keras.datasets import mnist
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Conv2D, MaxPooling2D, Flatten
from tensorflow.keras.optimizers import RMSprop
model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3), activation='relu',
input_shape=(sample_rows, sample_columns, num_channels)))
# 64 3x3 kernels
(continues on next page)
model.summary()
# _________________________________________________________________
# Layer (type) Output Shape Param #
# =================================================================
# conv2d (Conv2D) (None, 26, 26, 32) 320
# _________________________________________________________________
# conv2d_1 (Conv2D) (None, 24, 24, 64) 18496
# _________________________________________________________________
# max_pooling2d (MaxPooling2D) (None, 12, 12, 64) 0
# _________________________________________________________________
# dropout (Dropout) (None, 12, 12, 64) 0
# _________________________________________________________________
# flatten (Flatten) (None, 9216) 0
# _________________________________________________________________
# dense (Dense) (None, 128) 1179776
# _________________________________________________________________
# dropout_1 (Dropout) (None, 128) 0
# _________________________________________________________________
# dense_1 (Dense) (None, 10) 1290
# =================================================================
# Total params: 1,199,882
# Trainable params: 1,199,882
# Non-trainable params: 0
# _________________________________________________________________
model.compile(loss='sparse_categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])
It is hard to obtain photogenic samples of every aspect. Image augmentation enables the auto-generation of new
samples from existing ones through random adjustment from rotation, shifts, zoom, brightness etc. The below samples
pertains to increasing samples when all samples in classes are balanced.
After setting the augmentation settings, we will need to decide how to “flow” the data, original samples into the model.
In this function, we can also resize the images automatically if necessary. Finally to fit the model, we use the model.
fit_generator function so that for every epoch, the full original samples will be augmented randomly on the fly.
They will not be stored in memory for obvious reasons.
Essentially, there are 3 ways to do this. First, we can flow the images from memory flow, which means we have to
load the data in memory first.
batch_size = 32
img_size = 100
model.fit_generator(train_flow,
steps_per_epoch=32,
epochs=15,
verbose=1,
validation_data=val_flow,
use_multiprocessing=True,
workers=2)
Second, we can flow the images from a directory flow_from_dataframe, where all classes of images are in that
single directory. This requires a dataframe which indicates which image correspond to which class.
dir = r'/kaggle/input/plant-pathology-2020-fgvc7/images'
train_flow = train_aug.flow_from_dataframe(train_df,
directory=dir,
x_col='image_name',
(continues on next page)
Third, we can flow the images from a main directory flow_from_directory, where all each class of images are
in individual subdirectories.
train_flow = train_aug.flow_from_directory(directory=dir,
classes=['subdir1', 'subdir2', 'subdir3'],
class_mode='categorical',
target_size=(img_size,img_size),
batch_size=32)
We can also use Kera’s ImageDataGenerator to generate new augmented images when there is class imbalance.
Imbalanced data can caused the model to predict the class with highest samples.
img = r'/Users/Desktop/post/IMG_20200308_092140.jpg'
# augmentation settings
aug = ImageDataGenerator(rotation_range=15,
width_shift_range=0.1,
height_shift_range=0.1,
shear_range=0.01,
zoom_range=[0.9, 1.25],
horizontal_flip=True,
vertical_flip=False,
fill_mode='reflect',
data_format='channels_last',
brightness_range=[0.5, 1.5])
For CNN, because of the huge research done, and the complexity in architecture, we can use existing ones. The latest
one is EfficientNet by Google which can achieve higher accuracy with fewer parameters.
For transfer learning for image recognition, the defacto is imagenet, whereby we can specify it under the weights
argument.
model = Sequential()
model.add(base)
model.add(GlobalAveragePooling2D())
model.add(Dense(classes, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=[
˓→'accuracy'])
return model
# alternatively...
def model(input_shape, classes):
model = efn.EfficientNetB3(input_shape=input_shape, weights='imagenet', include_
˓→top=False)
x = model.output
x = Flatten()(x)
x = Dropout(0.5)(x)
11.5 RNN
Recurrent Neural Network (RNN). A typical RNN looks like below, where X(t) is input, h(t) is output and A is the
neural network which gains information from the previous step in a loop. The output of one unit goes into the next
one and the information is passed.
11.5.1 Theory
Long Short Term Memory (LSTM) is a special kind of Recurrent Neural Networks (RNN) with the capability of
learning long-term dependencies. The intricacies lie within the cell, where 3 internal mechanisms called gates regulate
the flow of information. This consists of 4 activation functions, 3 sigmoid and 1 tanh, instead of the typical 1 activation
function. This medium from article gives a good description of it. An alternative, or simplified form of LSTM is Gated
Recurrent Unit (GRU).
LSTM requires input needs to be of shape (num_sample, time_steps, num_features) if using tensorflow
backend. This can be processed using keras’s TimeseriesGenerator.
X = [1,2,3,4,5,6,7,8,9,10]
y = [5,6,7,8,9,1,2,3,4,5]
data = TimeseriesGenerator(X, y,
length=time_steps,
stride=stride,
batch_size=num_sample)
data[0]
# (array([[1, 2, 3, 4, 5, 6],
# [2, 3, 4, 5, 6, 7],
# [3, 4, 5, 6, 7, 8],
# [4, 5, 6, 7, 8, 9]]), array([2, 3, 4, 5]))
# note that y-label is the next time step away
time_steps = 6
stride = 1
num_sample = 4
data = TimeseriesGenerator(X, y,
length=time_steps,
stride=stride,
batch_size=num_sample)
X = data[0][0]
y = data[0][1]
The code below uses LSTM for sentiment analysis in IMDB movie reviews.
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding
from tensorflow.keras.layers import LSTM
from tensorflow.keras.datasets import imdb
# embedding layer converts input data into dense vectors of fixed size of 20k words &
˓→128 hidden neurons, better suited for neural network
model = Sequential()
model.add(Embedding(20000, 128)) #for nlp
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2)) #128 memory cells
model.add(Dense(1, activation='sigmoid')) #1 class classification, sigmoid for binary
˓→classification
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])
model.fit(x_train, y_train,
batch_size=32,
epochs=15,
verbose=1,
validation_data=(x_test, y_test))
import numpy as np
import pandas as pd
def lstm(X_train, y_train, X_test, y_test, classes, epoch, batch, verbose, dropout)
model = Sequential()
# return sequences refer to all the outputs of the memory cells, True if next
˓→layer is LSTM
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])
model.fit(X, y,
batch_size=batch,
epochs= epoch,
verbose=verbose,
validation_data=(X_test, y_test))
return model
df = stock('S68', 10)
# train-test split-------------
df1 = df[:2400]
df2 = df[2400:]
X_train = df1[['High','Low','Open','Close','Volume']].values
y_train = df1['change'].values
X_test = df2[['High','Low','Open','Close','Volume']].values
y_test = df2['change'].values
# normalisation-------------
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
data = TimeseriesGenerator(X, y,
length=time_steps,
sampling_rate=sampling_rate,
batch_size=num_sample)
X_train = data[0][0]
y_train = data[0][1]
# model validation-------------
classes = 1
epoch = 2000
batch = 200
verbose = 0
dropout = 0.2
From Keras documentation, it is not recommended to save the model in a pickle format. Keras allows saving in a
HDF5 format. This saves the entire model architecture, weights and optimizers.
Reinforcement Learning
Reinforcement learning is an area of machine learning concerned with how software agents ought to take actions in
an environment so as to maximize some notion of cumulative reward.
12.1 Concepts
Basic Elements
Term Description
Agent A model/algorithm that is taked with learning to accomplish a task
Environment The world where agent acts in.
Action A decision the agent makes in an environment
Reward Signal A scalar indication of how well the agent is performing a task
State A description of the environment that can be perceived by the agent
Terminal State A state at which no further actions can be made by an agent
Fig. 1: https://www.kdnuggets.com/2018/03/5-things-reinforcement-learning.html
129
Data Science Documentation, Release 0.1
Term Description
Policy (𝜋) Function that outputs decisions the agent makes. In simple terms, it instructs what the agent should
do at each state.
Value Function that describes how good or bad a state is. It is the total amount of reward an agent is
Function predicted to accumulate over the future, starting from a state.
Model of Predicts how the environment will reac tto the agent’s actions. In given a state & action, what is the
Environ- next state and reward. Such an approach is called a model-based method, in contrast with model-free
ment methods.
Reinforcement learning helps to solve Markov Decision Process (MDP). The core problem of MDPs is to find a
“policy” for the decision maker: a function 𝜋 that specifies the action 𝜋(s) that the decision maker will choose when
in state s. The diagram illustrate the Markov Decision Process.
12.2 Q-Learning
Q-Learning is an example of model-free reinforcement learning to solve the Markov Decision Process. It derives the
policy by directly looking at the data instead of developing a model.
We first build a Q-table with each column as the type of action possible, and then each row as the number of possible
states. And initialise the table with all zeros.
Updating the function Q uses the following Bellman equation. Algorithms using such equation as an iterative update
are called value iteration algorithms.
Learning Hyperparameters
• Learning Rate (𝛼): how quickly a network abandons the former value for the new. If the learning rate is 1, the
new estimate will be the new Q-value.
• Discount Rate (𝛾): how much to discount the future reward. The idea is that the later a reward comes, the less
valuable it becomes. Think inflation of money in the real world.
Exploration vs Exploitation
A central dilemma of reinforcement learning is to exploit what it has already experienced in order to obtain a reward.
But in order to do that, it has to explore in order to make better actions in the future.
This is known as the epsilon greedy strategy. In the beginning, the epsilon rates will be higher. The bot will explore
the environment and randomly choose actions. The logic behind this is that the bot does not know anything about the
environment. However the more the bot explores the environment, the more the epsilon rate will decreases and the bot
starts to exploit the environment.
There are other algothrims to manage the exploration vs exploiation problem, like softmax.
Definitions
• argmax(x): position where the first max value occurs
Code
Start the environment and training parameters for frozen lake in AI gym.
import numpy as np
import gym
import random
env = gym.make("FrozenLake-v0")
action_size = env.action_space.n
state_size = env.observation_space.n
# Exploration parameters
epsilon = 1.0 # Exploration rate
max_epsilon = 1.0 # Exploration probability at start
min_epsilon = 0.01 # Minimum exploration probability
decay_rate = 0.005 # Exponential decay rate for exploration prob
## If this number > greater than epsilon --> exploitation (taking the biggest
˓→ Q value for this state)
if exp_exp_tradeoff > epsilon:
action = np.argmax(qtable[state,:])
# Take the action (a) and observe the outcome state(s') and reward (r)
new_state, reward, done, info = env.step(action)
total_rewards += reward
# Take the action (index) that have the maximum expected future reward given
˓→ that state
action = np.argmax(qtable[state,:])
if done:
# Here, we decide to only print the last state (to see if our agent is on
˓→the goal or fall into an hole)
env.render()
12.3 Resources
• https://towardsdatascience.com/reinforcement-learning-implement-grid-world-from-scratch-c5963765ebff
• https://medium.com/swlh/introduction-to-reinforcement-learning-coding-q-learning-part-3-9778366a41c0
• https://medium.com/@m.alzantot/deep-reinforcement-learning-demysitifed-episode-2-policy-iteration-value-iteration-and-q-978
• https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-0-q-learning-with-tables-and-neural-n
Evaluation
Sklearn provides a good list of evaluation metrics for classification, regression and clustering problems.
http://scikit-learn.org/stable/modules/model_evaluation.html
In addition, it is also essential to know how to analyse the features and adjusting hyperparameters based on different
evalution metrics.
13.1 Classification
Fig. 1: Wikipedia
Recall|Sensitivity: (True Positive / [True Positive + False Negative]) High recall means to get all positives (i.e., True
Positive + False Negative) despite having some false positives. Search & extraction in legal cases, Tumour detection.
Often need humans to filter false positives.
135
Data Science Documentation, Release 0.1
Fig. 2: Wikipedia
Fig. 3: https://www.youtube.com/watch?v=21Igj5Pr6u4
Precision: (True Positive / [True Positive + False Positive]) High precision means it is important to filter off the any
false positives. Search query suggestion, Document classification, customer-facing tasks.
F1-Score: is the harmonic mean of precision and sensitivity, ie., 2*((precision * recall) / (precision + recall))
1. Confusion Matrix
Plain vanilla matrix. Not very useful as does not show the labels. However, the matrix can be used to build a heatmap
using plotly directly.
print (sklearn.metrics.confusion_matrix(test_target,predictions))
array([[288, 64, 1, 0, 7, 3, 31],
[104, 268, 11, 0, 43, 15, 5],
[ 0, 5, 367, 15, 6, 46, 0],
[ 0, 0, 11, 416, 0, 4, 0],
[ 1, 13, 5, 0, 424, 4, 0],
[ 0, 5, 75, 22, 4, 337, 0],
[ 20, 0, 0, 0, 0, 0, 404]])
# this gives the values of each cell, but api unable to change the layout size
import plotly.figure_factory as ff
layout = go.Layout(width=800, height=500)
data = ff.create_annotated_heatmap(z=x,x=title,y=title)
iplot(data)
With pandas crosstab. Convert encoding into labels and put the two pandas series into a crosstab.
def forest(x):
if x==1:
return 'Spruce/Fir'
elif x==2:
return 'Lodgepole Pine'
elif x==3:
return 'Ponderosa Pine'
elif x==4:
return 'Cottonwood/Willow'
elif x==5:
return 'Aspen'
elif x==6:
return 'Douglas-fir'
elif x==7:
return 'Krummholz'
Using a heatmap.
2. Evaluation Metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
Accuracy: 0.95
Precision: 0.79
Recall: 0.60
F1: 0.68
There are many other evaluation metrics, a list can be found here:
from sklearn.metrics.scorer import SCORERS
for i in sorted(list(SCORERS.keys())):
print i
accuracy
adjusted_rand_score
average_precision
f1
f1_macro
(continues on next page)
3. Classification Report
Classification report shows the details of precision, recall & f1-scores. It might be misleading to just print out a binary
classification as their determination of True Positive, False Positive might differ from us. The report will tease out the
details as shown below. We can also set average=None & compute the mean when printing out each individual
scoring.
print('accuracy:\t', accuracy)
print('\nf1:\t\t',f1)
print('recall\t\t',recall)
(continues on next page)
print('\nf1_avg:\t\t',f1_avg)
print('recall_avg\t',recall_avg)
print('precision_avg\t',precision_avg)
print('\nConfusion Matrix')
print(confusion)
print('\n',classification_report(y_test, y_predict))
4. Decision Function
X_train, X_test, y_train, y_test = train_test_split(X, y_binary_imbalanced, random_
˓→state=0)
[(0, -23.176682692580048),
(0, -13.541079101203881),
(0, -21.722576315155052),
(0, -18.90752748077151),
(0, -19.735941639551616),
(0, -9.7494967330877031),
(1, 5.2346395208185506),
(0, -19.307366394398947),
(continues on next page)
5. Probability Function
X_train, X_test, y_train, y_test = train_test_split(X, y_binary_imbalanced, random_
˓→state=0)
# note that the first column of array indicates probability of predicting negative
˓→class,
[(0, 8.5999236926158807e-11),
(0, 1.31578065170999e-06),
(0, 3.6813318939966053e-10),
(0, 6.1456121155693793e-09),
(0, 2.6840428788564424e-09),
(0, 5.8320607398268079e-05),
(1, 0.99469949997393026),
(0, 4.1201906576825675e-09),
(0, 1.2553305740618937e-11),
(0, 3.3162918920398805e-10),
(0, 3.2460530855408745e-11),
(0, 3.1472051953481208e-09),
(0, 1.5699022391384567e-10),
(0, 1.9921654858205874e-05),
(0, 6.7057057309326073e-06),
(0, 1.704597440356912e-05),
(1, 0.99998640688336282),
(0, 9.8530840165646881e-13),
(0, 2.6020404794341749e-06),
(0, 5.9441185633886803e-12)]
If your problem involves kind of searching a needle in the haystack; the positive class samples are very rare compared
to the negative classes, use a precision recall curve.
from sklearn.metrics import precision_recall_curve
plt.figure()
plt.xlim([0.0, 1.01])
plt.ylim([0.0, 1.01])
plt.plot(precision, recall, label='Precision-Recall Curve')
plt.plot(closest_zero_p, closest_zero_r, 'o', markersize = 12, fillstyle = 'none', c=
˓→'r', mew=3)
plt.xlabel('Precision', fontsize=16)
plt.ylabel('Recall', fontsize=16)
plt.axes().set_aspect('equal')
plt.show()
Receiver Operating Characteristic (ROC) is used to show the performance of a binary classifier. Y-axis is True Positive
Rate (Recall) & X-axis is False Positive Rate (Fall-Out). Area Under Curve (AUC) of a ROC is used. Higher AUC
better.
The term came about in WWII where this metrics is used to determined a receiver operator’s ability to distinguish
false positive and true postive correctly in the radar signals.
Some classifiers have a decision_function method while others have a probability prediction method, and some have
both. Whichever one is available works fine for an ROC curve.
from sklearn.metrics import roc_curve, auc
plt.figure()
plt.xlim([-0.01, 1.00])
plt.ylim([-0.01, 1.01])
plt.plot(fpr_lr, tpr_lr, lw=3, label='LogRegr ROC curve (area = {:0.2f})'.format(roc_
˓→auc_lr))
Logarithmic Loss, or Log Loss is a popular Kaggle evaluation metric, which measures the performance of a classifi-
cation model where the prediction input is a probability value between 0 and 1
Log Loss quantifies the accuracy of a classifier by penalising false classifications; the catch is that Log Loss ramps up
very rapidly as the predicted probability approaches 0. This article from datawookie gives a very good explanation.
13.2 Regression
For regression problems, where the response or y is a continuous value, it is common to use R-Squared and RMSE, or
MAE as evaluation metrics. This website gives an excellent description on all the variants of errors metrics.
R-squared: Percentage of variability of dataset that can be explained by the model.
MSE. Mean squared error. Squaring then getting the mean of all errors (so change negatives into positives).
RMSE: Squared root of MSE so that it gives back the error at the same scale (as it was initially squared).
MAE: Mean Absolute Error. For negative errors, convert them to positive and obtain all error means.
The RMSE result will always be larger or equal to the MAE. If all of the errors have the same magnitude, then
RMSE=MAE. Since the errors are squared before they are averaged, the RMSE gives a relatively high weight to large
errors. This means the RMSE should be more useful when large errors are particularly undesirable.
# R2
r2_full = fullmodel.score(predictor, target)
r2_trains = model3.score(X_train, y_train)
r2_tests = model3.score(X_test, y_test)
print('\nr2 full:', r2_full)
print('r2 train:', r2_trains)
print('r2 test:', r2_tests)
# get predictions
y_predicted_total = model3.predict(predictor)
y_predicted_train = model3.predict(X_train)
y_predicted_test = model3.predict(X_test)
# get MSE
MSE_total = mean_squared_error(target, y_predicted_total)
MSE_train = mean_squared_error(y_train, y_predicted_train)
MSE_test = mean_squared_error(y_test, y_predicted_test)
# get MAE
MAE_total = mean_absolute_error(target, y_predicted_total)
MAE_train = mean_absolute_error(y_train, y_predicted_train)
MAE_test = mean_absolute_error(y_test, y_predicted_test)
RMSLE Root Mean Square Log Error is a very popular evaluation metric in data science competition now. It helps to
reduce the effects of outliers compared to RMSE.
More: https://medium.com/analytics-vidhya/root-mean-square-log-error-rmse-vs-rmlse-935c6cc1802a
Takes more time and computation to use k-fold, but well worth the cost. By default, sklearn uses stratified k-fold cross
validation. Another type is ‘leave one out’ cross-validation.
The mean of the final scores among each k model is the most generalised output. This output can be compared to
different model results for comparison.
More here.
cross_val_score is a compact function to obtain the all scoring values using kfold in one line.
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
X = df[df.columns[1:-1]]
y = df['Cover_Type']
For greater control, like to define our own evaluation metrics etc., we can use KFold to obtain the train & test indexes
for each fold iteration.
from sklearn.model_selection import KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score
model = RandomForestClassifier()
kfold_custom(X, y, model, f1score)
There are generally 3 methods of hyperparameters tuning, i.e., Grid-Search, Random-Search, or the more automated
Bayesian tuning.
13.4.1 Grid-Search
From Stackoverflow: Systematically working through multiple combinations of parameter tunes, cross validate each
and determine which one gives the best performance. You can work through many combination only changing param-
eters a bit.
Print out the best_params_ and rebuild the model with these optimal parameters.
Simple example.
model = RandomForestClassifier()
grid_values = {'n_estimators':[150,175,200,225]}
grid = GridSearchCV(model, param_grid = grid_values, cv=5)
grid.fit(predictor, target)
print(grid.best_params_)
print(grid.best_score_)
# {'n_estimators': 200}
# 0.786044973545
Others.
dataset = load_digits()
X, y = dataset.data, dataset.target == 1
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
# choose a classifier
clf = SVC(kernel='rbf')
grid_clf_acc.fit(X_train, y_train)
y_decision_fn_scores_acc = grid_clf_acc.decision_function(X_test)
grid_clf_auc.fit(X_train, y_train)
y_decision_fn_scores_auc = grid_clf_auc.decision_function(X_test)
# results 1
('Grid best parameter (max. accuracy): ', {'gamma': 0.001})
('Grid best score (accuracy): ', 0.99628804751299183)
# results 2
('Test set AUC: ', 0.99982858122393004)
('Grid best parameter (max. AUC): ', {'gamma': 0.001})
('Grid best score (AUC): ', 0.99987412783021423)
13.4.2 Auto-Tuning
Bayesian Optimization as the name implies uses Bayesian optimization with Gaussian processes for autotuning. It is
one of the most popular package now for auto-tuning. pip install bayesian-optimization
More: https://github.com/fmfn/BayesianOptimization
# CV scores
scores = catboost.cv(cv_dataset, params, fold_count=3)
# 4) Start optimizing
# init_points: no. steps of random exploration. Helps to diversify random space
# n_iter: no. steps for bayesian optimization. Helps to exploit learnt parameters
optimizer.maximize(init_points=2, n_iter=3)
model = RandomForestRegressor(**params)
model.fit(X_train, y_train)
y_predict = model.predict(X_test)
score = rmsle(y_test, y_predict)
return -score
# Search space
pbounds = {'n_estimators': (1, 5),
'max_depth': (10,50)}
Bayesian Tuning and Bandits (BTB) is a package used for auto-tuning ML models hyperparameters. It similarly
uses Gaussian Process to do this, though there is an option for Uniform. It was born from a Master thesis by Laura
Gustafson in 2018. Because it is lower level than the above package, it has better flexibility, e.g., defining a k-fold
cross-validation.
https://github.com/HDI-Project/BTB
from btb.tuning import GP
from btb import HyperParameter, ParamTypes
score_list = []
param_list = []
for i in range(epoch):
# ** unpacks dict in a argument
model = RandomForestClassifier(**parameters, n_jobs=-1)
model.fit(X_train, y_train)
y_predict = model.predict(X_test)
score = accuracy_score(y_test, y_predict)
if verbose==0:
pass
elif verbose==1:
print('epoch: {}, accuracy: {}'.format(i+1,score))
elif verbose==2:
print('epoch: {}, accuracy: {}, param: {}'.format(i+1,score,parameters))
best_s = tuner._best_score
best_score_index = score_list.index(best_s)
best_param = param_list[best_score_index]
print('\nbest accuracy: {}'.format(best_s))
print('best parameters: {}'.format(best_param))
return best_param
For regression models, we have to make some slight modifications, since the optimization of hyperparameters is tuned
towards a higher evaluation score.
score_list = []
param_list = []
for i in range(epoch):
model = RandomForestRegressor(**parameters, n_jobs=10, verbose=3)
model.fit(X_train, y_train)
y_predict = model.predict(X_test)
score = np.sqrt(mean_squared_error(y_test, y_predict))
if verbose==0:
pass
elif verbose==1:
print('epoch: {}, rmse: {}, param: {}'.format(i+1,score,parameters))
score = -score
best_s = tuner._best_score
best_score_index = score_list.index(best_s)
best_param = param_list[best_score_index]
print('\nbest rmse: {}'.format(best_s))
print('best parameters: {}'.format(best_param))
return best_param
Auto-Sklearn is another auto-ml package that automatically selects both the model and its hyperparameters.
https://automl.github.io/auto-sklearn/master/
import autosklearn.classification
import sklearn.model_selection
import sklearn.datasets
import sklearn.metrics
X, y = sklearn.datasets.load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y,
˓→random_state=1)
automl = autosklearn.classification.AutoSklearnClassifier()
automl.fit(X_train, y_train)
Auto Keras uses neural network for training. Similar to Google’s AutoML approach.
import autokeras as ak
clf = ak.ImageClassifier()
clf.fit(x_train, y_train)
results = clf.predict(x_test)
Explainability
While sklearn’s supervised models are black boxes, we can derive certain plots and metrics to interprete the outcome
and model better.
Decision trees and other tree ensemble models, by default, allow us to obtain the importance of features. These are
known as impurity-based feature importances.
While powerful, we need to understand its limitations, as described by sklearn.
• they are biased towards high cardinality (numerical) features
• they are computed on training set statistics and therefore do not reflect the ability of feature to be useful to
make predictions that generalize to the test set (when the model has enough capacity).
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
model = rf.fit(X_train, y_train)
155
Data Science Documentation, Release 0.1
return f_impt
f_impt = feature_impt(model)
To overcome the limitations of feature importance, a variant known as permutation importance is available. It also has
the benefits of being about to use for any model. This Kaggle article provides a good clear explanation
How it works is the shuffling of individual features and see how it affects model accuarcy. If a feature is important,
the model accuarcy will be reduced more. If not important, the accuarcy should be affected a lot less.
sorted_idx = result.importances_mean.argsort()
plt.figure(figsize=(12,10))
plt.boxplot(result.importances[sorted_idx].T,
vert=False, labels=X.columns[sorted_idx]);
A third party also provides the same API pip install eli5.
import eli5
from eli5.sklearn import PermutationImportance
The output is as below. +/- refers to the randomness that shuffling resulted in. The higher the weight, the more
important the feature is. Negative values are possible, but actually refer to 0; though random chance caused the
predictions on shuffled data to be more accurate.
While feature importance shows what variables most affect predictions, partial dependence plots show how a feature
affects predictions. Using the fitted model to predict our outcome, and by repeatedly alter the value of just one
variable, we can trace the predicted outcomes in a plot to show its dependence on the variable and when it plateaus.
https://www.kaggle.com/dansbecker/partial-plots
# plot it
pdp.pdp_plot(pdp_goals, 'Goal Scored')
plt.show()
2D Partial Dependence Plots are also useful for interactions between features.
pdp.pdp_interact_plot(pdp_interact_out=inter1,
feature_names=features_to_plot,
plot_type='contour')
plt.show()
14.4 SHAP
SHapley Additive exPlanations (SHAP) break down a prediction to show the impact of each feature.
https://www.kaggle.com/dansbecker/shap-values
The explainer differs with the model type:
• shap.TreeExplainer(my_model) for tree models
• shap.DeepExplainer(my_model) for neural networks
• shap.KernelExplainer(my_model) for all models, but slower, and gives approximate SHAP val-
ues
Utilities
This page lists some useful functions and tips to make your datascience journey smoother.
15.1 Persistance
Data, models and scalers are examples of objects that can benefit greatly from pickling. For the former, it allows
multiples faster loading compared to other sources since it is saved in a python format. For others, there are no other
ways of saving as they are natively python objects.
Saving dataframes.
import pandas as pd
df.to_pickle('df.pkl')
df = pd.read_pickle('df.pkl')
import pickle
pickle.dump(model, open('model_rf.pkl', 'wb'))
pickle.load(open('model_rf.pkl', 'rb'))
From sklearn’s documentation, it is said that in the specific case of scikit-learn, it may be better to use joblib’s replace-
ment of pickle (dump & load), which is more efficient on objects that carry large numpy arrays internally as is often
the case for fitted scikit-learn estimators, but can only pickle to the disk and not to a string.
More: https://scikit-learn.org/stable/modules/model_persistence.html
import joblib
joblib.dump(clf, 'model.joblib')
joblib.load('model.joblib')
161
Data Science Documentation, Release 0.1
If the dataset is huge, it can be a problem storing the dataframe in memory. However, we can reduce the dataset size
significantly by analysing the data values for each column, and change the datatype to the smallest that can fit the
range of values. Below is a function created by arjanso in Kaggle that can be plug and play.
import pandas as pd
import numpy as np
def reduce_mem_usage(df):
'''
Source: https://www.kaggle.com/arjanso/reducing-dataframe-memory-size-by-65
Reduce size of dataframe significantly using the following process
1. Iterate over every column
2. Determine if the column is numeric
3. Determine if the column can be represented by an integer
4. Find the min and the max value
5. Determine and apply the smallest datatype that can fit the range of values
'''
start_mem_usg = df.memory_usage().sum() / 1024**2
print("Memory usage of properties dataframe is :",start_mem_usg," MB")
NAlist = [] # Keeps track of columns that have missing values filled in.
for col in df.columns:
if df[col].dtype != object: # Exclude strings
# Print current column type
print("******************************")
print("Column: ",col)
print("dtype before: ",df[col].dtype)
# make variables for Int, max and min
IsInt = False
mx = df[col].max()
mn = df[col].min()
print("min for this col: ",mn)
print("max for this col: ",mx)
# Integer does not support NA, therefore, NA needs to be filled
if not np.isfinite(df[col]).all():
NAlist.append(col)
df[col].fillna(mn-1,inplace=True)
Pandas is fast but that is dependent on the dataset too. We can use multiprocessing to make processing in pandas
multitudes faster by
• splitting a column into partitions
• spin off processes to run a specific function in parallel
• union the partitions together back into a Pandas dataframe
Note that this only works for huge datasets, as it also takes time to spin off processes, and union back partitions
together.
# from http://blog.adeel.io/2016/11/06/parallelize-pandas-map-or-apply/
import numpy as np
import multiprocessing as mp
def func(x):
return x * 10
Jupyter Notebook is the go-to IDE for data science. However, it can be further enhanced using jupyter extensions. pip
install jupyter_contrib_nbextensions && jupyter contrib nbextension install
Some of my favourite extensions are:
• Table of Contents: Sidebar showing TOC based on
• ExecuteTime: Time to execute script for each cell
• Variable Inspector: Overview of all variables saved in memory. Allow deletion of variables to save mem-
ory.
More: https://towardsdatascience.com/jupyter-notebook-extensions-517fa69d2231
Flask
Flask is a micro web framework written in Python. It is easy and fast to implement with the knowledge of basic web
development and REST APIs. How is it relevant to model building? Sometimes, it might be necessary to run models
in the a server or cloud, and the only way is to wrap the model in a web application. Flask is the most popular library
for such a task.
16.1 Basics
This gives a basic overall of how to run flask, with the debugger on, and displaying a static index.html file. A
browser can then be nagivated to http://127.0.0.1:5000/ to view the index page.
app = Flask(__name__)
@app.route('/')
def index():
return render_template('index.html')
if __name__ == '__main__':
app.run(debug = True)
There are some default directory structure to adhere to. The first is that HTML files are placed under /templates,
second is for Javascript, CSS or other static files like images, models or logs will be placed under /static
app.py
config.py
(continues on next page)
165
Data Science Documentation, Release 0.1
Flask by default comes with a configuration dictionary which can be called as below.
print(app.config)
{'APPLICATION_ROOT': '/',
'DEBUG': True,
'ENV': 'development',
'EXPLAIN_TEMPLATE_LOADING': False,
'JSONIFY_MIMETYPE': 'application/json',
'JSONIFY_PRETTYPRINT_REGULAR': False,
'JSON_AS_ASCII': True,
'JSON_SORT_KEYS': True,
'MAX_CONTENT_LENGTH': None,
'MAX_COOKIE_SIZE': 4093,
'PERMANENT_SESSION_LIFETIME': datetime.timedelta(days=31),
'PREFERRED_URL_SCHEME': 'http',
'PRESERVE_CONTEXT_ON_EXCEPTION': None,
'PROPAGATE_EXCEPTIONS': None,
'SECRET_KEY': None,
'SEND_FILE_MAX_AGE_DEFAULT': datetime.timedelta(seconds=43200),
'SERVER_NAME': None,
'SESSION_COOKIE_DOMAIN': None,
'SESSION_COOKIE_HTTPONLY': True,
'SESSION_COOKIE_NAME': 'session',
'SESSION_COOKIE_PATH': None,
'SESSION_COOKIE_SAMESITE': None,
'SESSION_COOKIE_SECURE': False,
'SESSION_REFRESH_EACH_REQUEST': True,
'TEMPLATES_AUTO_RELOAD': None,
'TESTING': False,
'TRAP_BAD_REQUEST_ERRORS': None,
'TRAP_HTTP_EXCEPTIONS': False,
'USE_X_SENDFILE': False}
However, for a large project, if there are multiple environments, each with different set of config values, we can create
a configuration file. Refer to the links below for more.
• https://pythonise.com/series/learning-flask/flask-configuration-files
• https://flask.palletsprojects.com/en/0.12.x/config/#configuring-from-files
There are various ways to pass variables into or manipulate html using flask.
We can use the double curly brackets {{ variable_name }} in html, and within flask define a route. Within the
render_template, we pass in the variable.
In Python
@app.route('/upload', methods=["POST"])
def upload_file():
img_path = 'static/img'
img_name = 'img_{}.png'
img = os.path.join(img_path, img_name)
file = request.files['image_upload']
file.save(img)
In HTML
<div class="row">
<img class="img-thumbnail" src={{img_show}} alt="">
</div>
In JavaScript
<script>
image_path = "{{ img_show }}";
</script>
We can implement python code in the html using the syntax, i.e., {% if something %}. However, note that we
need to close it with the same synatx also, i.e. {% endif %}.
In Python
@app.route('/upload', methods=["POST"])
def upload_file():
img_path = 'static/img'
(continues on next page)
In HTML
{% if img_show %}
<div class="row">
<img class="img-thumbnail" src={{img_show}} alt="">
</div>
{% endif %}
16.5 Testing
There are a number of HTTP request methods. Below are the two commonly used ones.
GET | Sends data in unencrypted form to the server. E.g. the ? values in URL
POST | Used to send HTML form data to server. Data received not cached by server.
16.5.1 Postman
Postman is a free software that makes it easy to test your APIs. After launching the flask application, we can send a
JSON request by specifying the method (POST), and see the JSON response at the bottom panel.
16.5.2 Python
Similarly, we can also send a request using the Python “requests” package.
import requests
# send request
res = requests.post('http://localhost:5000/api', json={'key':'value'})
# receieve response
print(res.content)
16.5.3 CURL
We can use curl (Client URL) through the terminal as an easy access to test our API too. Here’s a simple test to see
the API works, without sending the data.
<div class="row">
<form action="/upload" method="post" enctype="multipart/form-data">
<input type="file" name="image_upload" accept=".jpg,.jpeg,.gif,.png" />
<button type="submit" class="btn btn-primary">Submit</button>
</form>
</div>
In Python
import os
from time import time
@app.route('/upload', methods=["POST"])
def upload_file():
img_path = 'static/img'
return render_template('index.html')
To upload multiple files, end the html form tag with “multiple”.
16.7 Logging
We can use the in-built Python logging package for storing logs. Note that there are 5 levels of logging, DEBUG,
INFO, WARNING, ERROR and CRITICAL. If initial configuration is set at a high level, e.g., WARNING, lower
levels of logs, i.e., DEBUG and INFO will not be logged.
Below is a basic logger.
import logging
logging.basicConfig(level=logging.INFO, \
filename='../logfile.log', \
format='%(asctime)s :: %(levelname)s :: %(message)s')
# some script
logger.warning('This took x sec for model to complete')
We can use the function RotatingFileHandler to limit the file size maxBytes and number of log files
backupCount to store. Note that the latter argument must be at least 1.
import logging
from logging.handlers import RotatingFileHandler
16.8 Docker
If the flask app is to be packaged in Docker, we need to set the IP to localhost, and expose the port during docker run.
if __name__ == "__main.py__":
app.run(debug=True, host='0.0.0.0')
If we run docker ps, under PORTS, we should be able to see that the Docker host IP 0.0.0.0 and port 5000, is
accessible to the container at port 5000.
We can and should set environment variables; i.e., variables stored in the OS, especially for passwords and keys, rather
than in python scripts. This is because you don’t want to upload them to the github, or other version control platforms.
Hence, it reduces the need to copy/paste the keys into the script everytime you launch the app.
To do this, in Mac/Linux, we can store the environment variable in a .bash_profile.
# open/create bash_profile
nano ~/.bash_profile
# restart bash_profile
source ~/.bash_profile
We can also add this to the .bashrc file so that the variable will not be lost each time you launch/restart the bash
terminal.
if [ -f ~/.bash_profile ]; then
. ~/.bash_profile
fi
In the flask script, we can then obtain the variable by using the os package.
import os
SECRET_KEY = os.environ.get("SECRET_KEY")
For flask apps in docker containers, we can add an -e to include the environment variable into the container.
Sometimes certain configurations differ between the local development and server production environments. We can
set a condition like the below.
We try not to interfer with the FLASK_ENV variable which by default uses production, but instead create a new one.
if os.environ['ENV'] == 'production':
UPLOAD_URL = 'url/in/production/server'
elif os.environ['ENV'] == 'development'
UPLOAD_URL = '/upload'
We can then set the flask environment in docker as the below. Or if we are not using docker, we can export
ENV=development; python app.py.
We can use multi-processing or multi-threading to run parallel processing. Note that we should not end with thread.
join() or p.join() or the app will hang.
def prediction(json_input):
# prediction
pred_json = predict_single(save_img_path,
json_input,
display=False, ensemble=False,
save_dir=os.path.join(ABS_PATH, LOCAL_RESULT_FOLDER))
# post request
@app.route('/api', methods=["POST"])
def process_img():
json_input = request.json
Flask as a server is meant for development, as it tries to remind you everytime you launch it. One reason is because it
is not built to handle multiple requests, which almost always occur in real-life.
The way to patch this deficiency is to first, set up a WSGI (web server gateway interface), and then a web server.
The former is a connector to interface the python flask app to an established web server, which is built to handle
concurrency and queues.
For WSGI, there are a number of different ones, including gunicorn, mod_wsgi, uWSGI, CherryPy, Bjoern. The
example below shows how to configure for a WSGI file. we give the example name of flask.wsgi. The flask app
must also be renamed as application.
#! /usr/bin/python
import sys
import os
sys.path.insert(0, "/var/www/app")
sys.path.insert(0,'/usr/local/lib/python3.6/site-packages')
sys.path.insert(0, "/usr/local/bin/")
os.environ['PYTHONPATH'] = '/usr/local/bin/python3.6'
For web servers, the two popular ones are Apache and Nginx. The example below shows how to set up for Apache,
as well as configuring WSGI in the Dockerfile. Note that all configurations of WSGI is actually set in Apache’s
httpd.conf file.
FROM python:3.6
EXPOSE 5000
COPY requirements.txt .
RUN pip install -r requirements.txt
# enable full read/write/delete in static folder if files are to have full access
RUN chmod 777 -R /var/www/app/static
Gunicorn is another popular, and extremely easy to use WSGI. We can just install as pip install gunicorn.
# gunicorn -w 2 pythonScriptName:flaskAppName
# it uses port 8000 by default, but we can change it
gunicorn --bind 0.0.0.0:5000 -w 2 app:app
‘‘ sudo apt-get install nginx # ubuntu firewall sudo ufw status sudo ufw enable sudo ufw nginx http sudo ufw status
sudo ufw allow ssh
systemctl status nginx systemctl start nginx systemctl stop nginx systemctl restart nginx ‘‘
• https://www.appdynamics.com/blog/engineering/a-performance-analysis-of-python-wsgi-servers-part-2/
16.13 OpenAPI
OpenAPI specification is a description format for documenting Rest APIs. Swagger is an open-source set of tools to
build this OpenAPI standard. There are a number of python packages that integrate both flask & swagger together.
• https://github.com/flasgger/flasgger
Also known as throttling, it is necessary to control the number of requests each IP address can access at a given time.
This can be set using a library called Flask-Limiter pip install Flask-Limiter.
More settings from this article https://medium.com/analytics-vidhya/how-to-rate-limit-routes-in-flask-61c6c791961b
Flask is an old but well supported framework. However, asynchronous frameworks and the successor to WSGI, ASGI
(A=asynchronous) resulted in numerous alternatives, like FastAPI, Quart and Vibora.
• https://geekflare.com/python-asynchronous-web-frameworks/
FastAPI
FastAPI is one of the next generation python web framework that uses ASGI (asynchronous server gateway interface)
instead of the traditional WSGI. It also includes a number of useful functions to make API creations easier.
17.1 Uvicorn
FastAPI uses Uvicorn as its ASGI. We can configure its settings as described here https://www.uvicorn.org/settings/.
But basically we specify it in the fastapi python app script, or at the terminal when we launch uvicorn.
For the former, with the below specification, we can just execute python app.py to start the application.
app = FastAPI()
if __name__ == "__main__":
uvicorn.run('app:app', host='0.0.0.0', port=5000)
The documentation recommends that we use gunicorn which have richer features to better control over the workers
processes.
177
Data Science Documentation, Release 0.1
FastAPI uses the pydantic library to define the schema of the request & response APIs. This allows the auto
generation in the OpenAPI documentations, and for the former, for validating the schema when a request is received.
For example, given the json:
{
"ANIMAL_DETECTION": {
"boundingPoly": {
"normalizedVertices": [
{
"x": 0.406767,
"y": 0.874573,
"width": 0.357321,
"height": 0.452179,
"score": 0.972167
},
{
"x": 0.56781,
"y": 0.874173,
"width": 0.457373,
"height": 0.452121,
"score": 0.982109
}
]
},
"name": "Cat"
}
}
We can define in pydantic as below, using multiple basemodels for each level in the JSON.
• If there are no values input like y: float, it will listed as a required field
• If we add a value like y: float = 0.8369, it will be an optional field, with the value also listed as
a default and example value
• If we add a value like x: float = Field(..., example=0.82379), it will be a required field,
and also listed as an example value
More attributes can be added in Field, that will be populated in OpenAPI docs.
class lvl3_list(BaseModel):
x: float = Field(..., example=0.82379, description="X-coordinates"))
y: float = 0.8369
width: float
height: float
score: float
class lvl2_item(BaseModel):
normalizedVertices: List[lvl3_list]
class lvl1_item(BaseModel):
boundingPoly: lvl2_item
name: str = "Human"
class response_item(BaseModel):
HUMAN_DETECTION: lvl1_item
(continues on next page)
RESPONSE_SCHEMA = response_item
We do the same for the request schema and place them in the routing function.
import json
import base64
import numpy as np
JScontent = json.loads(request.json())
encodedImage = JScontent['requests'][0]['image']['content']
npArr = np.fromstring(base64.b64decode(encodedImage), np.uint8)
imgArr = cv2.imdecode(npArr, cv2.IMREAD_ANYCOLOR)
pred_output = model(imgArr)
return pred_output
We can render templates like html, and pass variables into html using the below. Like flask, in html, the variables are
called with double curly brackets {{variablemame}}.
app = FastAPI()
templates = Jinja2Templates(directory="templates")
@app.get('/')
def index():
UPLOAD_URL = '/upload/url'
MODULE = 'name of module'
return templates.TemplateResponse('index.html', \
{"upload_url": UPLOAD_URL, "module":MODULE})
17.4 OpenAPI
OpenAPI documentations of Swagger UI or Redoc are automatically generated. You can access it at the endpoints of
/docs and /redoc.
First, the title, description and versions can be specified from the initialisation of fastapi.
version="1.0.0")
The request-response schema and examples will be added after its inclusion in a post/get request routing function.
With the schemas defined using pydantic.
17.5 Asynchronous
• https://medium.com/@esfoobar/python-asyncio-for-beginners-c181ab226598
Docker
Containers allow a developer to package up an application with all of the parts it needs, such as libraries and other
dependencies, and ship it all out as one package. They allow a modular construction of an application, or microservice
in short; and being OS agnostic. Docker is a popular tool designed to make it easier to create, deploy, and run
applications by using containers. The image is developed using Linux.
Preprocessing scripts and models can be created as a docker image snapshot, and launched as one or multiple con-
tainers in production. For models that require to be consistently updated, we need to use volume mapping such that it
is not removed when the container stops running.
A connection to read features and output prediction needs to be done. This can be done via a REST API using Flask
web server, or through a messenger application like RabbitMQ or Kafka.
To start of a new project, create a new folder. This should only contain your docker file and related python files.
18.1.1 Dockerfile
A Dockerfile named as such, is a file without extension type. It contains commands to tell docker what are the
steps to do to create an image. It consists of instructions & arguments.
The commands run sequentially when building the image, also known as a layered architecture. Each layer is cached,
such that when any layer fails and is fixed, rebuilding it will start from the last built layer. This is why as you see
below, we install the python packages first before copying the local files. If any of the local files are changed, there is
no need to rebuild the python packages again.
• FROM tells Docker which image you base your image on (eg, Python 3 or continuumio/miniconda3).
• RUN tells Docker which additional commands to execute.
• CMD tells Docker to execute the command when the image loads.
181
Data Science Documentation, Release 0.1
To pass environment variables from docker run to the python code, we can use two methods.
1) Using os.environ.get in python script
import os
ip_address = os.environ.get('webcam_ip')
Then specify in docker run the variable for user input, followed by the image name
# in Dockerfile
CMD python -u main.py
# in bash
docker run -e webcam_ip=192.168.133.1 image_name
# in python script
import sys
webcam_ip = str(sys.argv[1])
# in Dockerfile
ENTRYPOINT [ "python", "-u", "main.py" ]
# in bash
docker run image_name 192.168.133.1
You do not want to compile any files that is not required in the images to keep the size at a minimium. A file,
.dockerignore similar in function and syntax to .gitignore can be used. It should be placed at the root,
together with the Dockerfile. Below are some standard files/folders to ignore.
# macos
**/.DS_Store
# python cache
**/__pycache__
.git
docker build -t imageName . –(-t = tag the image as) build and name image, “.” as current directory to look
for Dockerfile
Note that everytime you rebuild an image with the same name, the previous image will have their image name & tag
displayed as <None>.
Dockerhub is similar to Github whereby it is a repository for your images to be shared with the community. Note that
Dockerhub can only allow a single image to be made private for the free account.
docker login –login into dockerhub, before you can push your image to the server
docker push account/image_name –account refers to your dockerhub account name, this tag needs to cre-
ated during docker build command when building the image
In a production environment, a docker compose file can be used to run all separate docker containers together. It
consists of all necessary configurations that a docker run command provides in a yaml file.
So, instead of entering multiple docker run image, we can just run one docker-compose.yml file to start
all images. We also input all the commands like ports, volumes, depends_on, etc.
For Linux, we will need to first install docker compose. https://docs.docker.com/compose/install/. For Mac, it is
already preinstalled with docker.
Run docker-compose up command to launch, or docker-compose up -d in detached mode. If there
are some images not built yet, we can add another specification in the docker compose file e.g., build: /
directory_name.
version: '3'
services:
facedetection:
build: ./face
container_name: facedetection
ports:
- 5001:5000
restart: always
calibration:
build: ./calibration
container_name: calibration
ports:
- 5001:5000
restart: always
• https://www.docker.com/blog/containerized-python-development-part-2/
Docker Swarm allows management of multiple docker containers as clones in a cluster to ensure high availability in
case of failure. This is similar to Apache Spark whereby there is a Cluster Manager (Swarm Manager), and worker
nodes.
web:
image: "webapp"
deploy:
replicas: 5
database:
image: "mysql"
Use the command docker stack deploy -c docker_compose.yml to launch the swarm.
18.4 Networking
The Bridge Network is a private internal network created by Docker. All containers are attached to this network by
default and they get an IP of 172.17.xxx. They are thus able to communicate with each other internally. However, to
access these networks from the outside world, we need to
• map ports of these containers to the docker host.
• or associate the containers to the network host, meaning the container use the same port as the host network
There will come an instance when we need to communicate between containers. There are three ways to go about it.
First, we can use the docker container IP address. However this is not ideal as the IP can change. To obtain the IP, use
docker inspect, and use the IP.
The recommended way is to create a network and specify the container to run within that network. Note that the name
of the container is also the hostname, while the port is the internal port, not what is
If we need to connect from a docker container to some application running outside in localhost, we cant use the usual
http://localhost. Instead, we need to call using http://host.docker.internal.
18.5 Commands
Help
Create Image
docker build -t (-t = tag the image as) build and name image, “.” is the location of the
image_name . dockerfile
docker run Ubuntu:17.04 semicolon specifies the version (known as tags as listed in Docker-
hub), else will pull the latest
docker run ubuntu vs docker run the first is an official image, the 2nd with the “/” is created by the
mmumshad/ubuntu community
docker run -d image_name (-d = detach) docker runs in background, and you can continue typ-
ing other commands in the bash. Else need to open another termi-
nal.
docker run -v /local/storage/ (-v = volume mapping) all data will be destroyed if container is
folder:/image/data/folder stopped
mysql
docker run -p 5000:5000 to auto restart container if it crashes
--restart always comply
docker run --name give a name to the container
containerName imageName
Fig. 4: running docker with a command. each container has a unique container ID, container name, and their base
image name
docker image prune delete intermediate images tagged as <none> after recreating images from some
changes
docker container delete stopped containers
prune
docker system prune delete all unused/stopped containers/images/ports/etc.
docker run -it image_name sh explore directories in a specific image. “exit” to get out of sh
Start/Stop Containers
Remove Containers/Images
Inside the docker container, if there is a need to view any files, we have to install an editor first apt-get update >
apt-get install nano. To exit the container exit.
Console Log
Any console prints will be added to the docker log, and it will grow without a limit, unless you assigned one to
it. The logs are stored in /var/lib/docker/containers/[container-id]/[container-id]-json.
log.
Statistics
Sometimes we need to check the CPU or RAM for leakage or utilisation rates.
docker stats check memory, CPU utilisations for all containers. Add container
name to be specific
docker -p 5000:5000 --memory assign a limit of 1GB to RAM. It will force the container to release
1000M --cpus="2.0" the memory without causing memory error
Docker images can get ridiculously large if you do not manage it properly. Luckily, there are various easy ways to go
about this.
1. Build a Proper Requirements.txt
Using the pipreqs library, it will scan through your scripts and generate a clean requirements.txt, without any
dependent or redundant libraries. Some manual intervention is needed if, the library is not installed from pip, but from
external links, or the library does not auto install dependencies.
2. Use Alpine or Slim Python
The base python image, example, RUN python:3.7 is a whooping ~900Mb. Using the Alpine Linux version Run
python:3.7-alpine, only takes up about 100Mb. However, some libraries might face errors during installation
for this light-weight version.
Alternatively, using the Slim version RUN python:3.7-slim takes about 500Mb, which is a middle ground be-
tween alpine and the base version.
3. Install Libraries First
A logical sequential way of writing a Dockerfile is to copy all files, and then install the libraries.
FROM python:3.7-alpine
COPY . /app
WORKDIR /app
RUN pip install -r requirements.txt
CMD ["gunicorn", "-w 4", "main:app"]
However, a more efficient way is to utilise layer caching, i.e., installing libraries from requirements.txt before copying
the files over. This is because we will more then likely change our codes more frequently than update our libraries.
Given that installation of libraries takes much longer too, putting the installation first allows the next update of files to
skip this step.
FROM python:3.7-alpine
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . /app
WORKDIR /app
CMD ["gunicorn", "-w 4", "main:app"]
4. Multi-Stage Builds
Lastly, we can also use what we called multi-stage builds. During the pip installation, cache of libraries are stored
elsewhere and the resulting library is bigger then what it should have been.
What we can do is to copy the dependencies after building it, and paste it into a new base python platform.
FROM python:3.7-slim as base
COPY requirements.txt .
(continues on next page)
FROM python:3.7-slim
COPY . .
WORKDIR /app
* https://blog.realkinetic.com/building-minimal-docker-containers-for-python-
˓→applications-37d0272c52f3
* https://www.docker.com/blog/containerized-python-development-part-1/
* https://medium.com/swlh/alpine-slim-stretch-buster-jessie-bullseye-bookworm-what-
˓→are-the-differences-in-docker-62171ed4531d