ML and Deploying It Using Flask and Docker.
ML and Deploying It Using Flask and Docker.
ML and Deploying It Using Flask and Docker.
The data and answers refer to the features and target of the data and the rules
produced refer to the trained model. Machine learning is used in a variety of
industries like Agriculture, Software, Automobiles, Hospitals, and Healthcare.
Image 1
Table of Contents
• Types of learning
o Supervised learning
o Unsupervised learning
o Reinforcement learning
• Approaching a Supervised Machine Learning Problem
• Data Preprocessing
o Handling Imbalanced Data
o Data Cleaning
o Data Transformation
o Data Reduction
• Data Modeling
o Regression algorithms
o Classification algorithms
• Performance Metrics
o Classification metrics
o Regression metrics
• Solving a Classification problem using Kaggle’s Titanic data
• Deployment
o Flask
o Docker
Types of Learning
1. Supervised Learning
2. Unsupervised Learning
3. Reinforcement Learning
1. Supervised Learning
Supervised learning is a type of learning that requires features and targets in data. For example, to
predict a house price the feature would be the house price column and the features would look like
the number of bedrooms, square feet, location, and etc.
The supervised model attempts to learn the relationship between the features and the target and
leverage the model to predict the unseen data. The target is necessary to learn the X->Y mapping in
the data.
1.1 Regression
1.2 Classification
1.1 Regression
Regression is a type of supervised machine learning in which the target values contain continuous
data. Examples of continuous data include 1.2, 3.1, and etc.
Linear regression attempts to find the optimal parameters for the model by the concept called
correlation. The following are the assumptions of linear regression
• Linearity
• No Multicollinearity
• Normality
• Homoscedasticity
• No autocorrelation of the residuals
• Linear Regression
• Random Forest Regressor
• Gradient Boosting Regressor
• Polynomial Regression
• XGBoost Regressor
1.2 Classification
Classification is a type of supervised machine learning in which the target values contain discrete
data or categorical data. Examples of categorical data include red, green, and etc. The target classes
can be binary or multi-class.
A binary target refers to having only two unique classes in the target variable but in the multi-class
target refers to having more than two unique classes in the target column.
• Logistic Regression
• Decision Trees
• Random Forest
• SVM
• XGBoost Classifier
2. Unsupervised Learning
Unsupervised learning is a type of learning that finds the underlying pattern in the provided data.
Clustering techniques include K-Means attempt to find similar data and attempts to group it
together around the centroids.
Clustering is very useful in segmenting customers and identifying similar customer behaviors.
Clustering techniques are employed to detect anomalies in the dataset.
• K-Means
• K-Medoids
• The DBSCAN algorithm
3. Reinforcement Learning
• Q-Learning
• Multi-armed bandits
In this post, we are going to focus on supervised learning and the techniques to
approach the problem and deploy it using flask and docker.
Before feeding the data to the model it is very important to preprocess the data.
The model follows a simple rule “Garbage in, garbage out”, the quality of the
model depends on the quality of the data that you feed in. So, data preprocessing
plays an important role in approaching the machine learning problem.
Data Preprocessing
Data preprocessing is a process that needs to be carried out in order to produce high-quality
solutions. 80% of the work carried out in the machine learning pipeline would be data preprocessing.
The real-world data are noisy, inconsistent, and incomplete. The model couldn’t
handle all these inherent and vicious attributes of data. Noisy data refers to the
random errors, and inconsistent data refers to the inconsistent entries in the data,
incomplete data refers to the data that is not available.
import pandas as pd
TRAIN_DATA_PATH = 'https://raw.githubusercontent.com/agconti/kaggle-
titanic/master/data/train.csv'
TEST_DATA_PATH = 'https://raw.githubusercontent.com/agconti/kaggle-
titanic/master/data/test.csv'
TARGET_NAME = 'Survived'
train_data = pd.read_csv(TRAIN_DATA_PATH)
test_data = pd.read_csv(TEST_DATA_PATH)
# x_train = features, y_train = target
train_data = pd.read_csv(TRAIN_DATA_PATH)
test_data = pd.read_csv(TEST_DATA_PATH)
x_train, y_train = train_data.drop([TARGET_NAME, 'PassengerId'], axis=1),
train_data[TARGET_NAME]
x_test = test_data
A data is said to be imbalanced if the class distributions of the target are not proportionate. There
are a lot of ways to handle imbalanced data, use different metrics like F1 score, ROC-AUC score
instead of accuracy.
2. Data Cleaning
Missing value imputation is a part of data cleaning. Missing value imputation is a process of filling in
the missing values in the data. Missing value imputation for numeric data differs from categoric
data.
For numeric data, we can use techniques like mean imputation, median imputation
and for categoric data, we can use techniques like mode imputation. Other
imputation techniques are KNN, MissForest, and Linear Regression to impute the
data.
Median imputation
Median is the middlemost value in the data. The mean imputation is highly prone
to be affected by the influence of outliers.
Mode imputation
Mode is the value that occurs more in the data. Mode imputation can be used to
impute categorical data.
from sklearn.impute import SimpleImputer
si_median = SimpleImputer(strategy="median")
si_mode = SimpleImputer(strategy="most_frequent")
numeric_data = x_train.select_dtypes('number')
categoric_data = x_train.select_dtypes(exclude='number')
# median imputation
numeric_data_imputed = pd.DataFrame(
si_median.fit_transform(numeric_data.copy()),
columns=numeric_data.columns,
index=numeric_data.index
)
# mode imputation
categoric_data_imputed = pd.DataFrame(
si_mode.fit_transform(categoric_data.copy()),
columns=categoric_data.columns,
index=categoric_data.index
)
# concatenating numeric and categoric data
X_train_imputed = pd.concat(
[numeric_data_imputed, categoric_data_imputed],
axis=1
)
In the above code, we used SimpleImputer to impute the numeric columns with
the median of those columns and impute categoric columns with the mode of
those columns. The method named select_dtypes is used to select the data with
the provided data type, in our case we selected ‘number’ which would select all
the numerical data.
3. Data Transformation
Data transformation is the process of transforming the data in order to increase the model
performance. Widely used data transformation techniques are categorical encoding,
standardization, and normalization.
Representing the data on a different scale might reduce the effect of outliers in the
data. For example, let us say that a column in the data is highly right-skewed. In
order to reduce the skewness, we need to transform the data using log
transformation. By doing so, the data produced would be without skewness and
with normal distribution.
1. Categorical encoding
Categorical encoding is the process of converting categorical data into numeric data. The purpose of
categorical encoding is that the models can’t handle string data directly so we need to represent the
string data in a numerical format. Even though we are representing the data in the numerical format
the model handles it as a discrete variable.
There are many types of categorical encoding available like label encoding, target
encoding, frequency encoding, m-estimate encoding, and one-hot encoding. In this
post, we are going to see about label encoding, frequency encoding, and target
encoding.
Image Source
Installing category_encoders,
Image
Source: Author’s Jupyter Notebook
# frequency encoding
c_data = count_encoder.fit_transform(categoric_data.copy())
c_data.head()
Image
Source: Author’s Jupyter Notebook
# target encoding
c_data = target_encoder.fit_transform(categoric_data, y_train)
c_data.head()
2. Standardization
Standardization is the process of scaling the features such that the mean is 0 and
the standard deviation is 1. The purpose of using standardization is to convert the
data into similar scales and to make the training process faster in case if we used
optimization techniques such as gradient descent and etc. Standardization would
help us to improve the accuracy of the model.
2.1 Z-Score
3. Normalization
Normalization is the process of scaling data such that it falls between a certain
interval. For example, we can scale a feature such that the data it contains falls
under 0 and 1.
4. Data Reduction
Data reduction is the process of reducing the data along the column axis. We perform data reduction
techniques to avoid the overfitting phenomenon. Overfitting is a phenomenon in which the model would
show high performance in training but fails to generalize to the new unseen data.
One of the dimensionality reduction techniques is feature selection which discards a feature after looking
at the importance of the feature statistically. There are a wide variety of feature selection techniques
available like RFE (Recursive Feature Elimination), RFECV (Recursive Feature Elimination with Cross
Recursive Feature Elimination is a type of wrapper feature selection that requires a model to find the
importance of the feature. As the name says, It recursively eliminates the feature according to the
importance of the feature.
The setback of using RFE is that we have to provide a number of features to select. To
overcome this we can use RFECV which uses cross-validation to determine the number of
features to select. In the code below, the x_train must be preprocessed before using it.
Recursive Feature Elimination with Cross Validation is a type of wrapper feature selection that is
very similar to RFE but eliminates the hyperparameter n_features_to_select.
It recursively eliminates the features and uses cross-validation to determine the number of
features to select.
Data Modeling
Data modeling is the process of training a model that learns the relationship between the features
and the target. A large amount of data is preferable because more the data more are the patterns to
learn. The trained model is then used to predict the new unseen model. A highly complex and
flexible model like Decision-Tree is more prone to overfit the data.
To avoid overfitting we can use ensemble methods. Bagging is the process of combining
several weak models to produce an efficient model but the Boosting algorithms make use of
several weak models and feed each of the weak model’s output to the next weak model
Performance metrics are used to assess the quality of the trained model. In regression model
training is referred to as learning the parameters of the model, so measuring the performance
means, how well the model has learned the parameters. To know the performance we use cost
Regression and classification techniques require a separate set of metrics to analyze the
performance of the model. Gradient descent, RMS Prop, and Adam are cost function optimization
techniques to optimize the model. The gradient descent optimization algorithm finds the optimal
parameters of the model by finding the error gradients. We can run the algorithm for many cycles
until we arrive at a globally optimal solution.
• Accuracy score
• F1 score
• ROC-AUC score
• Log loss
Before getting into solving the problem, let us have a look at what Flask is.
What is Flask
Image Source
FlaskAndDocker
|—-> training.py
|—-> utils.py
|—-> requirements.txt
|—-> Dockerfile
The following code is in the module utils.py. The following code is used for
frequency encoding, median and mode imputation, and robust scaling. All these
are used for preprocessing.
import pandas as pd
from sklearn.base import TransformerMixin
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import RobustScaler
model_path = "/FlaskAndDocker/model.pkl"
pipeline_path = '/FlaskAndDocker/pipeline.pkl'
TRAIN_DATA_PATH = 'https://raw.githubusercontent.com/agconti/kaggle-
titanic/master/data/train.csv'
TEST_DATA_PATH = 'https://raw.githubusercontent.com/agconti/kaggle-
titanic/master/data/test.csv'
TARGET_NAME = 'Survived'
class FrequencyEncoder(TransformerMixin):
"""
Frequency Encoder that handles ties in
encoding.
"""
@staticmethod
def _get_categoric_columns(data):
"""Return categoric column names"""
data_dtypes_dict = dict(data.dtypes)
categorical_columns = [
k for k, v in data_dtypes_dict.items()
if v == 'object' or pd.api.types.is_categorical_dtype(v)
]
return categorical_columns
self.categorical_columns =
FrequencyEncoder._get_categoric_columns(X)
if len(self.categorical_columns) == 0:
return self
self.frequency_dict_ = {}
data_len = len(X)
return self
Returns
-------
ranked_data : pandas.DataFrame
"""
if len(self.categorical_columns) == 0:
return X
return X
class Imputer(TransformerMixin):
"""
Median and Mode imputer
"""
@staticmethod
def to_lowercase(x):
if pd.isnull(x):
return x
return x.lower()
Parameters
----------
X : pandas.DataFrame
y : ignored
Returns
-------
self : object
"""
if self.strategy == 'median':
self.numerical_columns = list(X.select_dtypes('number').columns)
self.median_dict_ = {}
if len(self.numerical_columns) == 0:
return self
self.categorical_columns = list(X.select_dtypes('object').columns)
self.mode_dict_ = {}
if len(self.categorical_columns) == 0:
return self
return self
Parameters
----------
X : pandas.DataFrame
Returns
-------
X : pandas.DataFrame
Imputed data
"""
if self.strategy == 'median':
if len(self.numerical_columns) == 0:
return X
if len(self.categorical_columns) == 0:
return X
return X
class CustomRobustScaler(RobustScaler):
X_numeric_data = pd.DataFrame(
scaled_data,
columns=self.numerical_columns
)
X_remnant_data = X.drop(self.numerical_columns, axis=1)
X_original = pd.concat(
[
X_numeric_data.reset_index(drop=True),
X_remnant_data.reset_index(drop=True)
],
axis=1
)
X_original = X_original.sort_index(axis=1)
return X_original
Returns
-------
self : object
"""
self.numerical_columns = list(X.select_dtypes(['number']).columns)
print(f"Num COls: {self.numerical_columns}")
if len(self.numerical_columns) == 0:
"""No numerical columns detected"""
return self
X_numeric = X[self.numerical_columns]
super().fit(X_numeric)
return self
Returns
-------
X_original : scaled data
"""
if len(self.numerical_columns) == 0:
return X
X_numeric = X[self.numerical_columns]
scaled_data = super().transform(X_numeric)
return X_original
Flask==2.0.1
numpy==1.21.2
pandas==1.2.4
scikit_learn==1.0
@app.route("/train_titanic")
def train_titanic():
train_data = pd.read_csv(TRAIN_DATA_PATH)
x_train, y_train = (
train_data.drop([TARGET_NAME, 'PassengerId'], axis=1),
train_data[TARGET_NAME]
)
data_len = x_train.shape[0]
median_imputer = Imputer('median')
mode_imputer = Imputer('mode')
robust_scaler = CustomRobustScaler()
frequency_encoder = FrequencyEncoder(handle_unknown=1 / data_len)
preprocessing_steps = [
'median_imputer', 'mode_imputer',
'robust_scaler', 'frequency_encoder',
]
preprocessing_instances = [
median_imputer, mode_imputer,
robust_scaler, frequency_encoder
]
accuracy_list = []
models = [
LogisticRegression(),
GradientBoostingClassifier(),
RandomForestClassifier()
]
return {
"status": "success",
"selected_model": f"{model.__class__.__name__}",
"accuracy": f"{accuracy}",
}
@app.route("/test_titanic")
def test_titanic():
test_data = pd.read_csv(TEST_DATA_PATH)
x_test = test_data.drop("PassengerId", axis=1)
try:
model = pickle.load(open(model_path, 'rb'))
pipeline = pickle.load(open(pipeline_path, 'rb'))
except Exception as e:
print(f"Exception: {e}")
return {
'status': "failure",
"message": "Train the model first"
}
x_test_preprocessed = pipeline.transform(x_test)
predictions = model.predict(x_test_preprocessed)
return {
"status": "success",
"predictions": predictions.tolist()
}
if __name__ == "__main__":
app.run(host='0.0.0.0', port=8088)
First of all, the code defines a preprocessing pipeline, in our case, it is a pipeline that consists of
Median imputation, Mode imputation, Robust Scaler, and Frequency encoding. The median imputer
class imputes the missing values in the numerical columns with the median of those columns and the
mode imputer imputes the missing values in categorical columns with the mode (the most frequent
category) of those columns and Robust scaler standardizes the data such that the mean is zero and
the standard deviation is one with a more robust way that is not influenced by the outliers and the
frequency encoder encodes the categorical data by mapping the categories with the frequencies of
those categories.
When the fit_transform method is invoked the above pipeline would be executed. At first, the
pipeline code splits the data into numeric and categoric columns and applies appropriate
pipeline object to utilize it for transforming the test data. The methodology used here to write
preprocessors is also called the custom transformers, by taking advantage of the custom
After running the pipeline and preprocessing the data we train different models on the preprocessed
data and check the accuracy. We have used Logistic Regression, Gradient Boosting classifier, and
Random Forest classifier to train. In our case, the Random forest classifier performed better with an
accuracy of 98%.
After training the model, we have to pickle the model to predict it on the future data. Pickling is a
serialization technique that persists the model’s state and behavior.
python3 training.py
{
"accuracy": "0.9831649831649831",
"selected_model": "RandomForestClassifier",
"status": "success"
}
{
"status": "success",
"predictions": [0, 1, ......, 0]
}
Docker
Image 3
• Docker file
• Docker image
• Docker container
A Docker file is a blueprint of the docker image. The Docker file consists of
instructions layer by layer for building an image. A docker image is a read-only
template of the container. We can build even more layers on top of an image.
A running instance of a docker image is a docker container. You can create many containers based
on a single docker image.
Dockerfile contents,
FROM ubuntu
ADD . /FlaskAndDocker
WORKDIR /FlaskAndDocker
ENV TZ=Asia/Kolkata
RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone
RUN apt-get update && apt-get install python3.6 -y && apt-get install python3-
pip -y
RUN apt-get install vim -y
RUN pip3 install -r requirements.txt
CMD [“/bin/bash”]
• The command ‘FROM ubuntu’ means that the ubuntu image is the base image, if
there is no image found locally, the docker finds it from the docker hub.
• The command ‘ADD . /FlaskAndDocker’ is used to add all the files in the current
directory of the host machine to the /FlaskAndDocker directory.
• The command WORKDIR /FlaskAndDocker is used to change the directory to
FlaskAndDocker.
• After that, we are installing the python packages for our projects like python, vim,
pip, and the requirements of the project.
• The CMD command is the one that executes while starting.
The -t flag is used to provide a name to the docker image. Here we have provided
the name flask_and_docker. The ‘.’ symbol is used to denote that the Dockerfile is
in the current directory.
The -it flag is used to create a pseudo-terminal while creating a container. The -p
flag is used to map the ports of the host machine to the docker container. The
flask_and_docker is the image name that we have just created. The command
python3 training.py is used to run the flask application, this command overwrites
the CMD command in the docker file.
Now, to train the model, hit the following URL in the browser,
http://0.0.0.0:8088/train_titanic
And to test the model hit the following URL in the browser,
http://0.0.0.0:8088/test_titanic
In this way, you can preprocess, train, and deploy models using flask and docker.