Ashwin Final Paper
Ashwin Final Paper
Ashwin Final Paper
Ashwin Siripurapu
Stanford University Department of Computer Science
353 Serra Mall, Stanford, CA 94305
ashwin@cs.stanford.edu
Abstract minute of each trading day, we have the data listed in Ta-
ble 1.
Convolutional neural networks have revolutionized the
field of computer vision. In these paper, we explore a par- 2.2. Initial Choice of Features
ticular application of CNNs: namely, using convolutional Since the project requires us to use pixel data, I had
networks to predict movements in stock prices from a pic- to convert this price data into images. This presents an
ture of a time series of past price fluctuations, with the ul- interesting challenge in its own right, but a very obvious
timate goal of using them to buy and sell shares of stock in starting point is to take as our inputs (features) a graph of
order to make a profit. the price of the contract for some period of time into the
past (say, 30 minutes back) and then use that to predict the
price at some time in the future (say, 5 minutes ahead).
1. Introduction Then, if we predict that the price will go up (down), we
At a high level, we will train a convolutional neural will buy (sell) in the present and sell (buy) in 5 minutes to
network to take in an image of a graph of time series data acquire a profit.
for past prices of a given asset (in our cases, SPY contracts
traded on the NYSE). Then, we will predict the movement Firstly, what do we mean by the price of the contract?
of the price in the next few minutes. If the CNN correctly Recall from above that Google Finance provides us with
predicts price movements, we can make money by buying four separate prices for each minute of the trading day. For
when the CNN says the price will go up in the future, the time being, I have elected to use only the high and low
and then selling it at the higher price in a few minutes time. prices within a given minute, since these implicitly bound
the other two prices (open and close). Moreover, the high
We evaluate the trained network both using traditional and low intuitively contain more information than the open
statistical performance measures (viz., R2 ) and also with a and close prices, because the open and close prices are
paper trade simulator that enables us to see what would in a sense statistical artifacts: they are the prices that the
have happened if we had bought and sold contracts accord- market happened to be at at the time that the price series
ing to the CNNs predictions; in particular, we can see how was sampled by Google (or whomever was collecting the
profitable the strategy of following the trained CNN would data).
be. Naturally, this methodology is subject to the vulner-
ability that it is impossible to tell how other participants in Secondly, how far in the past should our time series
the market would have reacted to the presence of the CNNs graph go? This is in principle another hyperparameter that
buying and selling, but it does give us at least some measure should be tweaked once the convolutional network has
of confidence as to the CNNs abilities as a trader. been set up, but for now, I have gone with a 30minute
window into the past.
2. Problem Statement and Technical Approach
In conclusion, the inputs to the model are images of
2.1. Gathering Data the graph of high and low prices for 30 minute windows
The first step in the process of training a CNN to pick of time. These images are drawn using the numpy and
stocks is to gather some historical data. [1] provides matplotlib libraries and are saved as RGB images. An
minutebyminute ticker data on the S&P 500 ETF Trust example input is shown below in Figure 2.2.
(symbol: SPY), traded on the NYSE. Specifically, for each Later on, I experimented with using slightly different
1
Column Name Meaning
DATE Time (which minute of the day)
CLOSE Closing price (price at the end of the minute)
HIGH High price (maximum price during the minute)
LOW Low price (minimum price during the minute)
OPEN Opening price (price at the beginning of the minute)
VOLUME How many contracts were offered to be bought/sold in the minute
Table 1. Minutebyminute data provided by [1]
2
2.4. Choice of Loss Function model used (OLS) was extremely simple. Secondly, the
features (pixel data) bore little linear structure that could
I am going to use an `2 loss function when training the
have been exploited to predict log returns well. A convo-
convolutional network. In future, we can consider different
lutional network with many nonlinearities can rectify this
choices of loss function, but the `2 loss is very standard in
(no pun intended). Finally, the feature space used in this
regression problems in finance. Moreover, Caffe readily
OLS baseline was heavily reduced: we shrunk the images
supports `2 regression with its EUCLIDEAN LOSS layers.
to thumbnails and removed all color information. Given the
full input data, a CNN should be able to do significantly
It is important to note that, unlike the length of the input better.
window, the choice of loss function is not a hyperparameter
to be tuned. This is because different loss functions are
Ideally, we should be able to get R2 > 0 on an outof
different problems entirely, not merely different solutions
sample test set. This means that we are doing better than
to the same problem. Different loss functions correspond
the naive strategy of always guessing that the log return in
to different notions of the displeasure or dissatisfaction
the next 5 minutes will be the mean log return in the test
with our predictions that we are trying to minimize. It
set (usually around 0). If we can do this regularly, then
makes no sense to argue that one setting of parameters
provided we have good execution (ability to buy and sell
is better than another when the comparison is across
reasonably quickly), we have the makings of a profitable
different loss functions.
trading strategy.
That said, in trading, the ultimate test of how good a
4. Workflow
strategy or model is is how much money it makes. In that
sense, and in that sense alone, it may make sense to exper- In the following sections, I describe how I systematically
iment with different loss functions to derive different opti- made changes to the network architecture, to the hyperpa-
mization problems, and then see which optimization prob- rameters, and to the features (images) that were put into the
lem yields the most profitable strategy. model. Concretely, my workflow was as follows:
The most basic financial model is ordinary leastsquares 2. Convert image features and log return response into
regression (OLS). For purposes of establishing a baseline HDF5 using hdf5 convert.py.
for performance, I used this model on a very simple set of
features. 3. Generate network architecture file using [4], a script
provided by a fellow student on Piazza.
Concretely, I took the 600 800 time series graph 4. Tune hyperparameters by modifying solver.txt.
images and scaled each one down to a 32 54 thumbnail
image. In addition, I converted the images from four 5. Train network using Caffe.
channels (RGBA) to one (grayscale). The thumbnails then
corresponded to points in the space R1728 . 6. Visualize weights in trained network using
visualize weights.py.
Treating each grayscale thumbnail and its corresponding
7. Evaluate network by computing outofsample R2
log return as a training pair (xi , yi ), I then fit a linear model
with caffe compute r2.py.
to a training data set of 4000 points and tested it on a data
set of 996 points.
5. Hyperparameter Tuning
2
The withinsample R of the linear model was 0.428, The first thing that I did to achieve lower loss (hence
which is fairly impressive for such noisy data. However, higher R2 ) was to tweak the optimization hyperparameters,
the ultimate test of any statistical model is how it performs as specified in the solver.prototxt file. This includes
out of sample. The outofsample R2 for this linear model the starting learning rate, the learning rate update scheme
on the test set was an embarassing 12.2. Clearly no one and parameters, and the type of solver (SGD, Adagrad, or
should use this model to trade on the market, unless he NAG [Nesterov accelerated gradient]). I started out with
wants to lose a lot of money! 10,000 training iterations, with momentum SGD. started
out at 0.01 and was cut down by a factor of = 0.1 ev-
It should be possible for the final convolutional network ery 5,000 iterations (i.e., step size was set to 5,000). In
to beat these results easily. In the first place, the baseline addition, the momentum term was set to = 0.9.
3
Figure 2. Training and validation loss with SGD, init = 0.2,
= 0.9, = 0.5, step size = 2000
Figure 4. An example image input. As before, high prices in blue,
low prices in green. Volume (right axis) in red.
6. Feature Engineering
Recall from Figure 2.2 what a typical input price
window image looks like. After the poster session, some
commentors suggested a better choice of inputs. In
particular, my image inputs did not use the red channel to
encode any data at all. The red channel could have been put
Figure 3. Training and validation loss with NAG
to better use, for example, by using it to store data about
the average of the low and high prices, or the volume at
each minute of the trading day1 . Others suggested that I
This was far too low a learning rate, and too low a rate
use a different visualization in the image data: rather than
of annealing. As a result, training loss hardly moved from
plotting the absolute price at each time for a short window,
its initial value and validation loss remained fairly flat, too.
I could instead plot a spectrogram and visualize the price
data in the frequency domain.
I decided to increase the mobility of the optimization
hyperparameters by increasing the initial learning rate, in-
Ultimately, I experimented with two more kinds of in-
creasing the value of , and decreasing the step size (so
puts. The first one was similar to the original image data
would be updated more frequently). Concretely, I set the
in that it used a timedomain representation of the price se-
initial learning rate to 0.2, to 0.5, and step size to
ries, except that I also used volume data, which was plotted
2000. remained at the original value of 0.9. This resulted
in red on a separate set of axes. An example of this kind of
in the training and validation loss plot shown in Figure 2.
input is shown in Figure 4.
Following this, I decided to experiment with Nesterovs The other kind of representation that I tried was the so
accelerated gradient. To do this, I simply added the line called correlation features. Recall that the S&P 500 is a
solver type: NESTEROV to the solver file. This re- weighted basket of 500 different individual stocks (equi-
sulted in the training and validation loss depicted in Fig- ties). That is, owning a single unit (share) of SPY is equiv-
ure 3. This did not significantly improve over momentum alent to owning some number of shares of each of the 500
SGD loss. constituent corporations. The ten companies which com-
When I switched to using different network architec- prise the biggest share of the S&P 500 basket are shown in
tures and different features (see below), I had to update
1 Recall that volume is the total quantity of contracts available to be
the hyperparameters in solver.prototxt appropri-
bought or sold in a given minute. In actual trading scenarios, this is usu-
ately. Nonetheless, the same basic approach (come up with ally expressed as two numbers (number of contracts available for sale, and
some hyperparameters, run the network, plot the training number available for purchase), but Google Finances data added the two
and validation loss curves) proved useful and, in fact, the together and expressed them as a single sum.
4
Company Symbol % Assets
Apple Inc. AAPL 4.03
Exxon Mobil Corporation Common XOM 2.01
Microsoft Corporation MSFT 1.93
Johnson & Johnson Common Stock JNJ 1.54
Berkshire Hathaway Inc Class B BRK.B 1.44
General Electric Company Common GE 1.40
Wells Fargo & Company Common St WFC 1.38
Procter & Gamble Company (The) PG 1.23
JP Morgan Chase & Co. Common St JPM 1.23
Pfizer, Inc. Common Stock PFE 1.16
Table 2. Top 10 components of the S&P 500. Data from [2]
5
Figure 6. The weights of the first layer of the reduced architecture Figure 8. The weights of the last convolution layer of the reduced
network after training on price and volume features. architecture network after training on heatmap features.