The Glowing Python: regression

Showing posts with label regression. Show all posts

Wednesday, April 7, 2021

A Simple model that earned a Silver medal in predicting the results of the NCAAW tournament

This year I decided to join the March Machine Learning Mania 2021 - NCAAW challenge on Kaggle. It proposes to predict the outcome of each game into the basketball NCAAW tournament, which is a tournament for women at college level. Participants can assign a probability to each outcome and they're ranked on the leaderboard according to the accuracy of their prediction. One of the most attractive elements of the challenge is that the leaderboard is updated after each game throughout the tournament.

Since I have limited knowledge of basketball I decided to use a minimalistic model:

It uses three features that are easy to interpret: seed, percentage of victories, and the average score of each team.
It is based on linear Linear Regression, and it's tuned to predict extreme probability values only for games that are easy to predict.

The following visualizations give insight into how the model estimates the winning probability in a game between two teams:

Surprisingly, this model ranked 46th out of 451 submissions, placing itself in the top 11% of the leaderboard and earning a silver medal!

The notebook with the solution and some more charts can be found here.

Saturday, May 21, 2016

An intro to Regression Analysis with Decision Trees

It's a while that there are no posts on this blog, but the Glowing Python is still active and strong! I just decided to publish some of my post on the Cambridge Coding Academy blog. Here are the links to a series of two posts about Regression Analysis with Decision Trees:

In this introduction to Regression Analysis we will see how to user scikit-learn to train Decision Trees to solve a specific problem: "How to predict the number of bikes hired in a bike sharing system in a given day?"

In the first post, we will see how to train a simple Decision Tree to exploit the relation between temperature and bikes hired, this tree will be analysed to explain the result of the training process and gain insights about the data. In the second, we will see how to learn more complex decision trees and how to assess the accuracy of the prediction using cross validation.

Here's a sneak peak of the figures that we will generate:

Tuesday, January 27, 2015

Forecasting beer consumption with sklearn

In this post we will see how to implement a straightforward forecasting model based on the linear regression object of sklearn. The model that we are going to build is based on the idea idea that past observations are good predictors of a future value. Using some symbols, given x_n−k,...,x_n−2,x_n−1 we want to estimate x_n+h where h is the forecast horizon just using the given values. The estimation that we are going to apply is the following:

where x_n−k and x_n−1 are respectively the oldest and the newest observation we consider for the forecast. The weights w_k,...,w₁,w₀ are chosen in order to minimize

where m is number of periods available to train our model. This model is often referred as regression model with lagged explanatory variables and k is called lag order.

Before implementing the model let's load a time series to forecast:

import pandas as pd
df = pd.read_csv('NZAlcoholConsumption.csv')
to_forecast = df.TotalBeer.values
dates = df.DATE.values

The time series represent the total of alcohol consumed by quarter millions of litres from the 1st quarter of 2000 to 3rd quarter of 2012. The data is from New Zealand government and can be downloaded in csv from here. We will focus on the forecast of beer consumption.
First, we need to organize our data in forecast in windows that contain the previous observations:

import numpy as np

def organize_data(to_forecast, window, horizon):
    """
     Input:
      to_forecast, univariate time series organized as numpy array
      window, number of items to use in the forecast window
      horizon, horizon of the forecast
     Output:
      X, a matrix where each row contains a forecast window
      y, the target values for each row of X
    """
    shape = to_forecast.shape[:-1] + /
            (to_forecast.shape[-1] - window + 1, window)
    strides = to_forecast.strides + (to_forecast.strides[-1],)
    X = np.lib.stride_tricks.as_strided(to_forecast, 
                                        shape=shape, 
                                        strides=strides)
    y = np.array([X[i+horizon][-1] for i in range(len(X)-horizon)])
    return X[:-horizon], y

k = 4   # number of previous observations to use
h = 1   # forecast horizon
X,y = organize_data(to_forecast, k, h)

Now, X is a matrix where the i-th row contains the lagged variables x_n−k,...,x_n−2,x_n−1 and y[i] contains the i-th target value. We are ready to train our forecasting model:

from sklearn.linear_model import LinearRegression
 
m = 10 # number of samples to take in account
regressor = LinearRegression(normalize=True)
regressor.fit(X[:m], y[:m])

We trained our model using the first 10 observations, which means that we used the data from 1st quarter of 2000 to the 2nd quarter of 2002. We use a lag order of one year and a forecast horizon of 1 quarter. To estimate the error of the model we will use the mean absolute percentage error (MAPE). Computing this metric to compare the forecast of the remaining observation of the time series and the actual observations we have:

def mape(ypred, ytrue):
    """ returns the mean absolute percentage error """
    idx = ytrue != 0.0
    return 100*np.mean(np.abs(ypred[idx]-ytrue[idx])/ytrue[idx])

print 'The error is %0.2f%%' % mape(regressor.predict(X[m:]),y[m:])

The error is 6.15%

Which means that, on average, the forecast provided by our model differs from the target value only of 6.15%. Let's compare the forecast and the observed values visually:

figure(figsize=(8,6))
plot(y, label='True demand', color='#377EB8', linewidth=2)
plot(regressor.predict(X), 
     '--', color='#EB3737', linewidth=3, label='Prediction')
plot(y[:m], label='Train data', color='#3700B8', linewidth=2)
xticks(arange(len(dates))[1::4],dates[1::4], rotation=45)
legend(loc='upper right')
ylabel('beer consumed (millions of litres)')
show()

We note that the forecast is very close to the target values and that the model was able to learn the trends and anticipate them in many cases.