12/06/2020 Pytorch [Tabular] — Regression - Towards Data Science
Photo by Simon Basler on Unsplash [Image [0]]
HOW TO TRAIN YOUR NEURAL NET
Pytorch [Tabular] — Regression
This blog post takes you through an implementation of regression on tabular data using
PyTorch.
Akshaj Verma
Mar 28 · 8 min read
We will use the red wine quality dataset available on Kaggle. This dataset has 12
columns where the first 11 are the features and the last column is the target column.
The data set has 1599 rows.
Import Libraries
We’re using tqdm to enable progress bars for training and testing loops.
https://towardsdatascience.com/pytorch-tabular-regression-428e9c9ac93 1/13
12/06/2020 Pytorch [Tabular] — Regression - Towards Data Science
import numpy as np
import pandas as pd
import seaborn as sns
from tqdm.notebook import tqdm
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
Read Data
df = pd.read_csv("data/tabular/classification/winequality-red.csv")
df.head()
Input dataframe [Image [2]]
EDA and Preprocessing
First off, we plot the output rows to observe the class distribution. There’s a lot of
imbalance here. Classes 3, 4, and 8 have a very few number of samples.
We will not treat the output variables as classes here because we’re performing
regression. We will convert output column, which is all integers , to float values.
sns.countplot(x = 'quality', data=df)
https://towardsdatascience.com/pytorch-tabular-regression-428e9c9ac93 2/13
12/06/2020 Pytorch [Tabular] — Regression - Towards Data Science
Output Distribution [image [3]]
Create Input and Output Data
In order to split our data into train, validation, and test sets, we need to separate out
our inputs and outputs.
Input X is all but the last column. Output y is the last column.
X = df.iloc[:, 0:-1]
y = df.iloc[:, -1]
Train — Validation — Test
To create the train-val-test split, we’ll use train_test_split() from Sklearn.
First, we’ll split our data into train+val and test sets. Then, we'll further split our
train+val set to create our train and val sets.
Because there’s a “class” imbalance, we want to have equal distribution of all output
classes in our train, validation, and test sets.
To do that, we use the stratify option in function train_test_split() .
Remember that stratification only works with classes, not numbers. So, in general, we
can bin our numbers into classes using quartiles, deciles, histogram( np.histogram() )
and so on. So, you would have to create a new dataframe which contains the output
and it's "class". This "class" was obtained using the above mentioned methods.
https://towardsdatascience.com/pytorch-tabular-regression-428e9c9ac93 3/13
12/06/2020 Pytorch [Tabular] — Regression - Towards Data Science
In our case, let’s use the numbers as is because they are already like classes. After we
split our data, we can convert the output to float (because regression).
# Train - Test
X_trainval, X_test, y_trainval, y_test = train_test_split(X, y,
test_size=0.2, stratify=y, random_state=69)
# Split train into train-val
X_train, X_val, y_train, y_val = train_test_split(X_trainval,
y_trainval, test_size=0.1, stratify=y_trainval, random_state=21)
Normalize Input
Neural networks need data that lies between the range of (0,1). There’s a ton of
material available online on why we need to do it.
To scale our values, we’ll use the MinMaxScaler() from Sklearn. The MinMaxScaler
transforms features by scaling each feature to a given range which is (0,1) in our case.
x_scaled = (x-min(x)) / (max(x)–min(x))
Notice that we use .fit_transform() on X_train while we use .transform() on X_val
and X_test .
We do this because we want to scale the validation and test set with the same
parameters as that of the train set to avoid data leakage. fit_transform() calculates
scaling values and applies them while .transform() only applies the calculated values.
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
X_test = scaler.transform(X_test)
X_train, y_train = np.array(X_train), np.array(y_train)
X_val, y_val = np.array(X_val), np.array(y_val)
X_test, y_test = np.array(X_test), np.array(y_test)
Visualize Class Distribution in Train, Val, and Test
https://towardsdatascience.com/pytorch-tabular-regression-428e9c9ac93 4/13
12/06/2020 Pytorch [Tabular] — Regression - Towards Data Science
Once we’ve split our data into train, validation, and test sets, let’s make sure the
distribution of classes is equal in all three sets.
To do that, let’s create a function called get_class_distribution() . This function takes
as input the obj y , ie. y_train , y_val , or y_test . Inside the function, we initialize a
dictionary which contains the output classes as keys and their count as values. The
counts are all initialized to 0.
We then loop through our y object and update our dictionary.
def get_class_distribution(obj):
count_dict = {
"rating_3": 0,
"rating_4": 0,
"rating_5": 0,
"rating_6": 0,
"rating_7": 0,
"rating_8": 0,
}
for i in obj:
if i == 3:
count_dict['rating_3'] += 1
elif i == 4:
count_dict['rating_4'] += 1
elif i == 5:
count_dict['rating_5'] += 1
elif i == 6:
count_dict['rating_6'] += 1
elif i == 7:
count_dict['rating_7'] += 1
elif i == 8:
count_dict['rating_8'] += 1
else:
print("Check classes.")
return count_dict
Once we have the dictionary count, we use Seaborn library to plot the bar charts.
To make the plot, we first convert our dictionary to a dataframe using
pd.DataFrame.from_dict([get_class_distribution(y_train)]) .
Subsequently, we .melt() our convert our dataframe into the long format and finally
use sns.barplot() to build the plots.
https://towardsdatascience.com/pytorch-tabular-regression-428e9c9ac93 5/13
12/06/2020 Pytorch [Tabular] — Regression - Towards Data Science
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(25,7))
# Train
sns.barplot(data =
pd.DataFrame.from_dict([get_class_distribution(y_train)]).melt(), x
= "variable", y="value", hue="variable",
ax=axes[0]).set_title('Class Distribution in Train Set')
# Val
sns.barplot(data =
pd.DataFrame.from_dict([get_class_distribution(y_val)]).melt(), x =
"variable", y="value", hue="variable", ax=axes[1]).set_title('Class
Distribution in Val Set')
# Test
sns.barplot(data =
pd.DataFrame.from_dict([get_class_distribution(y_test)]).melt(), x =
"variable", y="value", hue="variable", ax=axes[2]).set_title('Class
Distribution in Test Set')
Output distribution after train-val-test split [Image [4]]
Convert Output Variable to Float
y_train, y_test, y_val = y_train.astype(float),
y_test.astype(float), y_val.astype(float)
Neural Network
Initialize Dataset
https://towardsdatascience.com/pytorch-tabular-regression-428e9c9ac93 6/13
12/06/2020 Pytorch [Tabular] — Regression - Towards Data Science
class RegressionDataset(Dataset):
def __init__(self, X_data, y_data):
self.X_data = X_data
self.y_data = y_data
def __getitem__(self, index):
return self.X_data[index], self.y_data[index]
def __len__ (self):
return len(self.X_data)
train_dataset = RegressionDataset(torch.from_numpy(X_train).float(),
torch.from_numpy(y_train).float())
val_dataset = RegressionDataset(torch.from_numpy(X_val).float(),
torch.from_numpy(y_val).float())
test_dataset = RegressionDataset(torch.from_numpy(X_test).float(),
torch.from_numpy(y_test).float())
Model Params
EPOCHS = 150
BATCH_SIZE = 64
LEARNING_RATE = 0.001
NUM_FEATURES = len(X.columns)
Initialize Dataloader
train_loader = DataLoader(dataset=train_dataset,
batch_size=BATCH_SIZE, shuffle=True)
val_loader = DataLoader(dataset=val_dataset, batch_size=1)
test_loader = DataLoader(dataset=test_dataset, batch_size=1)
Define Neural Network Architecture
We have a simple 3 layer feedforward neural net here. We use ReLU as the activation at
all layers.
https://towardsdatascience.com/pytorch-tabular-regression-428e9c9ac93 7/13
12/06/2020 Pytorch [Tabular] — Regression - Towards Data Science
class MultipleRegression(nn.Module):
def __init__(self, num_features):
super(MultipleRegression, self).__init__()
self.layer_1 = nn.Linear(num_features, 16)
self.layer_2 = nn.Linear(16, 32)
self.layer_3 = nn.Linear(32, 16)
self.layer_out = nn.Linear(16, 1)
self.relu = nn.ReLU()
def forward(self, inputs):
x = self.relu(self.layer_1(inputs))
x = self.relu(self.layer_2(x))
x = self.relu(self.layer_3(x))
x = self.layer_out(x)
return (x)
def predict(self, test_inputs):
x = self.relu(self.layer_1(test_inputs))
x = self.relu(self.layer_2(x))
x = self.relu(self.layer_3(x))
x = self.layer_out(x)
return (x)
Check for GPU
device = torch.device("cuda:0" if torch.cuda.is_available() else
"cpu")
print(device)
###################### OUTPUT ######################
cuda:0
Initialize the model, optimizer, and loss function. Transfer the model to GPU.
We are using the Mean Squared Error loss.
model = MultipleRegression(NUM_FEATURES)
model.to(device)
print(model)
https://towardsdatascience.com/pytorch-tabular-regression-428e9c9ac93 8/13
12/06/2020 Pytorch [Tabular] — Regression - Towards Data Science
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)
###################### OUTPUT ######################
MultipleRegression(
(layer_1): Linear(in_features=11, out_features=16, bias=True)
(layer_2): Linear(in_features=16, out_features=32, bias=True)
(layer_3): Linear(in_features=32, out_features=16, bias=True)
(layer_out): Linear(in_features=16, out_features=1, bias=True)
(relu): ReLU()
)
Train Model
Before we start our training, let’s define a dictionary which will store the loss/epoch
for both train and validation sets.
loss_stats = {
'train': [],
"val": []
}
Let the training begin.
print("Begin training.")
for e in tqdm(range(1, EPOCHS+1)):
# TRAINING
train_epoch_loss = 0
model.train()
for X_train_batch, y_train_batch in train_loader:
X_train_batch, y_train_batch = X_train_batch.to(device),
y_train_batch.to(device)
optimizer.zero_grad()
y_train_pred = model(X_train_batch)
train_loss = criterion(y_train_pred,
y_train_batch.unsqueeze(1))
train_loss.backward()
optimizer.step()
train_epoch_loss += train_loss.item()
https://towardsdatascience.com/pytorch-tabular-regression-428e9c9ac93 9/13
12/06/2020 Pytorch [Tabular] — Regression - Towards Data Science
# VALIDATION
with torch.no_grad():
val_epoch_loss = 0
model.eval()
for X_val_batch, y_val_batch in val_loader:
X_val_batch, y_val_batch = X_val_batch.to(device),
y_val_batch.to(device)
y_val_pred = model(X_val_batch)
val_loss = criterion(y_val_pred,
y_val_batch.unsqueeze(1))
val_epoch_loss += val_loss.item()
loss_stats['train'].append(train_epoch_loss/len(train_loader))
loss_stats['val'].append(val_epoch_loss/len(val_loader))
print(f'Epoch {e+0:03}: | Train Loss:
{train_epoch_loss/len(train_loader):.5f} | Val Loss:
{val_epoch_loss/len(val_loader):.5f}')
###################### OUTPUT ######################
Epoch 001: | Train Loss: 31.22514 | Val Loss: 30.50931
Epoch 002: | Train Loss: 30.02529 | Val Loss: 28.97327
.
.
.
Epoch 149: | Train Loss: 0.42277 | Val Loss: 0.37748
Epoch 150: | Train Loss: 0.42012 | Val Loss: 0.37028
You can see we’ve put a model.train() at the before the loop. model.train() tells
PyTorch that you’re in training mode.
Well, why do we need to do that? If you’re using layers such as Dropout or BatchNorm
which behave differently during training and evaluation (for example; not use dropout
during evaluation), you need to tell PyTorch to act accordingly.
Similarly, we’ll call model.eval() when we test our model. We’ll see that below.
https://towardsdatascience.com/pytorch-tabular-regression-428e9c9ac93 10/13
12/06/2020 Pytorch [Tabular] — Regression - Towards Data Science
Back to training; we start a for-loop. At the top of this for-loop, we initialize our loss per
epoch to 0. After every epoch, we’ll print out the loss and reset it back to 0.
Then we have another for-loop. This for-loop is used to get our data in batches from the
train_loader .
We do optimizer.zero_grad() before we make any predictions. Since the backward()
function accumulates gradients, we need to set it to 0 manually per mini-batch.
From our defined model, we then obtain a prediction, get the loss(and accuracy) for
that mini-batch, perform back-propagation using loss.backward() and
optimizer.step() .
Finally, we add all the mini-batch losses to obtain the average loss for that epoch. We
add up all the losses for each mini-batch and finally divide it by the number of mini-
batches ie. length of train_loader to obtain the average loss per epoch.
The procedure we follow for training is the exact same for validation except for the fact
that we wrap it up in torch.no_grad and not perform any back-propagation.
torch.no_grad() tells PyTorch that we do not want to perform back-propagation,
which reduces memory usage and speeds up computation.
Visualize Loss and Accuracy
To plot the loss line plots, we again create a dataframe from the `loss_stats` dictionary.
train_val_loss_df =
pd.DataFrame.from_dict(loss_stats).reset_index().melt(id_vars=
['index']).rename(columns={"index":"epochs"})
plt.figure(figsize=(15,8))
sns.lineplot(data=train_val_loss_df, x = "epochs", y="value",
hue="variable").set_title('Train-Val Loss/Epoch')
https://towardsdatascience.com/pytorch-tabular-regression-428e9c9ac93 11/13
12/06/2020 Pytorch [Tabular] — Regression - Towards Data Science
Train-Val loss curve [Image [6]]
Test Model
After training is done, we need to test how our model fared. Note that we’ve used
model.eval() before we run our testing code. To tell PyTorch that we do not want to
perform back-propagation during inference, we use torch.no_grad() , just like we did it
for the validation loop above.
y_pred_list = []
with torch.no_grad():
model.eval()
for X_batch, _ in test_loader:
X_batch = X_batch.to(device)
y_test_pred = model(X_batch)
y_pred_list.append(y_test_pred.cpu().numpy())
y_pred_list = [a.squeeze().tolist() for a in y_pred_list]
Let’s check the MSE and R-squared metrics.
mse = mean_squared_error(y_test, y_pred_list)
r_square = r2_score(y_test, y_pred_list)
print("Mean Squared Error :",mse)
print("R^2 :",r_square)
###################### OUTPUT ######################
Mean Squared Error : 0.40861496703609534
R^2 : 0.36675687655886924
https://towardsdatascience.com/pytorch-tabular-regression-428e9c9ac93 12/13
12/06/2020 Pytorch [Tabular] — Regression - Towards Data Science
. . .
Thank you for reading. Suggestions and constructive criticism are welcome. :)
This blogpost is a part of the column — ” How to train you Neural Net”. You can find
the column here.
You can find me on LinkedIn and Twitter. If you liked this, check out my other
blogposts.
Machine Learning Data Science AI Akshaj Wields Pytorch Programming
About Help Legal
Get the Medium app
https://towardsdatascience.com/pytorch-tabular-regression-428e9c9ac93 13/13