0% found this document useful (0 votes)
12 views16 pages

Generate Alpha

Uploaded by

nkufym
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views16 pages

Generate Alpha

Uploaded by

nkufym
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

8/27/2021 Generating Alpha with Vectorspace AI NLP/NLU Correlation Matrix Datasets: Equities vs The Periodic Table of Elements | by Gaëtan

y Gaëtan Rickter | Hacker…

Generating Alpha with Vectorspace AI


NLP/NLU Correlation Matrix Datasets:
Equities vs The Periodic Table of Elements
Gaëtan Rickter Follow
May 28, 2017 · 8 min read

I finally beat the S&P 500 by 10%. This might not sound like much but when we’re
dealing with large amounts of capital and with good liquidity, the profits are pretty
sweet for a hedge fund. More aggressive approaches have resulted in much higher
returns.

https://medium.com/hackernoon/unsupervised-machine-learning-for-fun-profit-with-basket-clusters-17a1161e7aa1 1/16
8/27/2021 Generating Alpha with Vectorspace AI NLP/NLU Correlation Matrix Datasets: Equities vs The Periodic Table of Elements | by Gaëtan Rickter | Hacker…

It all started after I read a paper by Gur Huberman titled “Contagious Speculation and a
Cure for Cancer: A Non-Event that Made Stock Prices Soar,” (with Tomer Regev, Journal
of Finance, February 2001, Vol. 56, №1, pp. 387–396). The research described an event
that occurred in 1998 with a public company called EntreMed (ENMD was the symbol at
the time):

“A Sunday New York Times article on a potential development of new cancer-curing


drugs caused EntreMed’s stock price to rise from 12.063 at the Friday close, to open at
85 and close near 52 on Monday. It closed above 30 in the three following weeks. The
enthusiasm spilled over to other biotechnology stocks. The potential breakthrough in
cancer research already had been reported, however, in the journal Nature, and in
various popular newspapers ~including the Times! more than five months earlier. Thus,
enthusiastic public attention induced a permanent rise in share prices, even though no
genuinely new information had been presented.”

Among the many insightful observations made by the researchers, one stood out in the
conclusion:

“[Price] movements may be concentrated in stocks that have some things in common,
but these need not be economic fundamentals.”

I wondered if it was possible to cluster stocks based on something other than what’s
usually used. I started digging around for datasets and after a few weeks I found one
that included scores describing strength of “known and hidden relationships” between
stocks and elements of the Periodic Table designed by Vectorspace AI.

Having a background in computational genomics, this also reminded me of how


relatively unknown the relationships are between genes and their cell signaling
networks. However, when we analyze the data, we begin to see new connections and
correlations we may not have been able to predict previously:

https://medium.com/hackernoon/unsupervised-machine-learning-for-fun-profit-with-basket-clusters-17a1161e7aa1 2/16
8/27/2021 Generating Alpha with Vectorspace AI NLP/NLU Correlation Matrix Datasets: Equities vs The Periodic Table of Elements | by Gaëtan Rickter | Hacker…

Expression patterns of selected genes involved signaling pathways for cell plasticity, growth and
differentiation — https://www.researchgate.net/figure/263706740_fig4_Expression-patterns-of-selected-
genes-involved-signaling-pathways-for-cell-plasticity

Equities, like genes, are influenced via a massive network of strong and weak hidden
relationships shared between one another. Some of these influences and relationships
can be predicted.

One of my goals was to create long and short clusters of stocks or “basket clusters” I
could use to hedge or just profit from. This would require an unsupervised machine
learning approach to create clusters of stocks that would share strong and weak
relationships with one another. These clusters would double as “baskets” of stocks my
firm could trade.

I started by downloading the dataset here. The dataset is based on relationships between
elements in the periodic table and public companies. In the future I’d like to work with
cryptocurrencies and create baskets similar to what these guys are doing here but that’s
a future project.

Then using Python and a subset of the usual machine learning suspects — scikit-learn,
numpy, pandas, matplotlib and seaborn, I set out to understand the shape of the dataset
I was dealing with. (To do some of this I looked to a Kaggle Kernel titled “Principal
Component Analysis with KMeans visuals”.
https://medium.com/hackernoon/unsupervised-machine-learning-for-fun-profit-with-basket-clusters-17a1161e7aa1 3/16
8/27/2021 Generating Alpha with Vectorspace AI NLP/NLU Correlation Matrix Datasets: Equities vs The Periodic Table of Elements | by Gaëtan Rickter | Hacker…

1 import numpy as np
2 import pandas as pd
3 from sklearn.decomposition import PCA
4 from sklearn.cluster import KMeans
5 import matplotlib.pyplot as plt
6 import seaborn as sb
7
8 np.seterr(divide='ignore', invalid='ignore')
9
10 # Quick way to test just a few column features
11 # stocks = pd.read_csv('supercolumns-elements-nasdaq-nyse-otcbb-general-UPDATE-2017-03-0
12
13 stocks = pd.read_csv('supercolumns-elements-nasdaq-nyse-otcbb-general-UPDATE-2017-03-01.
14
15 print(stocks.head())
16
17 str_list = []
18 for colname, colvalue in stocks.iteritems():
19 if type(colvalue[1]) == str:
20 str_list.append(colname)
21
22 # Get to the numeric columns by inversion
23 num_list = stocks.columns.difference(str_list)
24
25 stocks_num = stocks[num_list]
26
27 print(stocks_num.head())

hidden_relationships.py hosted with ❤ by GitHub view raw

Output: a quick view of the first 5 rows:

1 zack@twosigma-Dell-Precision-M3800:/home/zack/hedge_pool/baskets/hcluster$ ./hidden_rela
2 Symbol_update-2017-04-01 Hydrogen Helium Lithium Beryllium Boron \
3 0 A 0.0 0.00000 0.0 0.0 0.0
4 1 AA 0.0 0.00000 0.0 0.0 0.0
5 2 AAAP 0.0 0.00461 0.0 0.0 0.0
6 3 AAC 0.0 0.00081 0.0 0.0 0.0
7 4 AACAY 0.0 0.00000 0.0 0.0 0.0
8
9 Carbon Nitrogen Oxygen Fluorine ... Fermium Mendelevium \
10 0 0.006632 0.0 0.007576 0.0 ... 0.000000 0.079188
11 1 0.000000 0.0 0.000000 0.0 ... 0.000000 0.000000
12 2 0.000000 0.0 0.000000 0.0 ... 0.135962 0.098090
https://medium.com/hackernoon/unsupervised-machine-learning-for-fun-profit-with-basket-clusters-17a1161e7aa1 4/16
8/27/2021 Generating Alpha with Vectorspace AI NLP/NLU Correlation Matrix Datasets: Equities vs The Periodic Table of Elements | by Gaëtan Rickter | Hacker…
12 2 0.000000 0.0 0.000000 0.0 ... 0.135962 0.098090
13 3 0.000000 0.0 0.018409 0.0 ... 0.000000 0.000000
14 4 0.000000 0.0 0.000000 0.0 ... 0.000000 0.000000
15
16 Nobelium Lawrencium Rutherfordium Dubnium Seaborgium Bohrium Hassium \
17 0 0.197030 0.1990 0.1990 0.0 0.0 0.0 0.0
18 1 0.000000 0.0000 0.0000 0.0 0.0 0.0 0.0
19 2 0.244059 0.2465 0.2465 0.0 0.0 0.0 0.0
20 3 0.000000 0.0000 0.0000 0.0 0.0 0.0 0.0
21 4 0.000000 0.0000 0.0000 0.0 0.0 0.0 0.0
22
23 Meitnerium
24 0 0.0
25 1 0.0
26 2 0.0
27 3 0.0
28 4 0.0
29
30 [5 rows x 110 columns]
31 Actinium Aluminum Americium Antimony Argon Arsenic Astatine \
32 0 0.000000 0.0 0.0 0.002379 0.047402 0.018913 0.0
33 1 0.000000 0.0 0.0 0.000000 0.000000 0.000000 0.0
34 2 0.004242 0.0 0.0 0.001299 0.000000 0.000000 0.0
35 3 0.000986 0.0 0.0 0.003378 0.000000 0.000000 0.0
36 4 0.000000 0.0 0.0 0.000000 0.000000 0.000000 0.0
37
38 Barium Berkelium Beryllium ... Tin Titanium Tungsten Uranium \
39 0 0.0 0.000000 0.0 ... 0.0 0.002676 0.0 0.000000
40 1 0.0 0.000000 0.0 ... 0.0 0.000000 0.0 0.000000
41 2 0.0 0.141018 0.0 ... 0.0 0.000000 0.0 0.004226
42 3 0.0 0.000000 0.0 ... 0.0 0.000000 0.0 0.004086
43 4 0.0 0.000000 0.0 ... 0.0 0.000000 0.0 0.000000
44
45 Vanadium Xenon Ytterbium Yttrium Zinc Zirconium
46 0 0.000000 0.0 0.0 0.000000 0.000000 0.0
47 1 0.000000 0.0 0.0 0.000000 0.000000 0.0
48 2 0.002448 0.0 0.0 0.018806 0.008758 0.0
49 3 0.001019 0.0 0.0 0.000000 0.007933 0.0
50 4 0.000000 0.0 0.0 0.000000 0.000000 0.0
51
52 [5 rows x 109 columns]
53 zack@twosigma-Dell-Precision-M3800:/home/zack/hedge_pool/baskets/hcluster$

out.txt hosted with ❤ by GitHub view raw


https://medium.com/hackernoon/unsupervised-machine-learning-for-fun-profit-with-basket-clusters-17a1161e7aa1 5/16
8/27/2021 Generating Alpha with Vectorspace AI NLP/NLU Correlation Matrix Datasets: Equities vs The Periodic Table of Elements | by Gaëtan Rickter | Hacker…

A Pearson Correlation of concept features. In this case, minerals and elements from the
periodic table:

1 stocks_num = stocks_num.fillna(value=0, axis=1)


2
3 X = stocks_num.values
4
5 from sklearn.preprocessing import StandardScaler
6 X_std = StandardScaler().fit_transform(X)
7
8 f, ax = plt.subplots(figsize=(12, 10))
9 plt.title('Pearson Correlation of Concept Features (Elements & Minerals)')
10
11 # Draw the heatmap using seaborn
12 sb.heatmap(stocks_num.astype(float).corr(),linewidths=0.25,vmax=1.0, square=True, cmap="
13 sb.plt.show()

hidden_relationships-02.py hosted with ❤ by GitHub view raw

Output: (ran against the first 16 samples for this visualization example). It’s also
interesting to see how elements in the periodic table correlate to public companies. At
some point, I’d like to use the data to predict breakthroughs a company might make
based on their correlation to interesting elements or materials.

https://medium.com/hackernoon/unsupervised-machine-learning-for-fun-profit-with-basket-clusters-17a1161e7aa1 6/16
8/27/2021 Generating Alpha with Vectorspace AI NLP/NLU Correlation Matrix Datasets: Equities vs The Periodic Table of Elements | by Gaëtan Rickter | Hacker…

Measuring ‘Explained Variance’ & Principal Component Analysis


(PCA)
Explained variance = (total variance - residual variance). The number of PCA projection
components that should be worth looking at can be guided by the Explained Variance
Measure which is also nicely described in Sebastian Raschka’s article on Principal
Component Analysis: http://sebastianraschka.com/Articles/2015_pca_in_3_steps.html

1 # Calculating Eigenvectors and eigenvalues of Cov matirx


2 mean_vec = np.mean(X_std, axis=0)
3 cov_mat = np.cov(X_std.T)
4 eig_vals, eig_vecs = np.linalg.eig(cov_mat)
5
6 # Create a list of (eigenvalue, eigenvector) tuples
7 eig_pairs = [ (np.abs(eig_vals[i]),eig_vecs[:,i]) for i in range(len(eig_vals))]
8
9 # Sort from high to low
10 eig_pairs.sort(key = lambda x: x[0], reverse= True)
11
12 # Calculation of Explained Variance from the eigenvalues
13 tot = sum(eig_vals)
14 var_exp = [(i/tot)*100 for i in sorted(eig_vals, reverse=True)]
15 cum_var_exp = np.cumsum(var_exp) # Cumulative explained variance
16
17 # Variances plot
18 max_cols = len(stocks.columns) - 1
19 plt.figure(figsize=(10, 5))
20 plt.bar(range(max_cols), var_exp, alpha=0.3333, align='center', label='individual explai
21 plt.step(range(max_cols), cum_var_exp, where='mid',label='cumulative explained variance'
22 plt ylabel('Explained variance ratio')
https://medium.com/hackernoon/unsupervised-machine-learning-for-fun-profit-with-basket-clusters-17a1161e7aa1 7/16
8/27/2021 Generating Alpha with Vectorspace AI NLP/NLU Correlation Matrix Datasets: Equities vs The Periodic Table of Elements | by Gaëtan Rickter | Hacker…
22 plt.ylabel( Explained variance ratio )
23 plt.xlabel('Principal components')
24 plt.legend(loc='best')
25 plt.show()

hidden_relationships-03.py hosted with ❤ by GitHub view raw

Output:

From this chart we can see that a large amount of variance comes from the first 85% of
the predicted Principal Components. It’s a high number so let’s start at the low end and
model for just a handful of Principal Component. More information on analyzing a
reasonable number of Principal Components can be found here.

Using scikit-learn’s PCA module, lets set n_components = 9. The second line of the code
calls the “fit_transform” method, which fits the PCA model with the standardized movie
data X_std and applies the dimensionality reduction on this dataset.

1 pca = PCA(n_components=9)
2 x_9d = pca.fit_transform(X_std)
3
4 plt.figure(figsize = (9,7))
5 plt.scatter(x_9d[:,0],x_9d[:,1], c='goldenrod',alpha=0.5)
6 plt.ylim(-10,30)
7 plt show()
https://medium.com/hackernoon/unsupervised-machine-learning-for-fun-profit-with-basket-clusters-17a1161e7aa1 8/16
8/27/2021 Generating Alpha with Vectorspace AI NLP/NLU Correlation Matrix Datasets: Equities vs The Periodic Table of Elements | by Gaëtan Rickter | Hacker…
7 plt.show()

hidden_relationships-04.py hosted with ❤ by GitHub view raw

Output:

We don’t really observe even faint outlines of clusters here so we should likely continue
adjusting n_component values until we see something we like. This relates to the “art”
part of data science and art.

Now lets try the K-means to see if we are able to visualize any distinct clusters in the next
section.

K-Means Clustering
A simple K-Means will now be applied using the PCA projection data.

https://medium.com/hackernoon/unsupervised-machine-learning-for-fun-profit-with-basket-clusters-17a1161e7aa1 9/16
8/27/2021 Generating Alpha with Vectorspace AI NLP/NLU Correlation Matrix Datasets: Equities vs The Periodic Table of Elements | by Gaëtan Rickter | Hacker…

Using scikit-learn’s KMeans() call and the “fit_predict” method, we compute cluster
centers and predict cluster indices for the first and third PCA projections (to see if we
can observe any appreciable clusters). We then define our own color scheme and plot
the scatter diagram as follows:

1 # Set a 3 KMeans clustering


2 kmeans = KMeans(n_clusters=3)
3 # Compute cluster centers and predict cluster indices
4 X_clustered = kmeans.fit_predict(x_9d)
5
6 # Define our own color map
7 LABEL_COLOR_MAP = {0 : 'r',1 : 'g',2 : 'b'}
8 label_color = [LABEL_COLOR_MAP[l] for l in X_clustered]
9
10 # Plot the scatter digram
11 plt.figure(figsize = (7,7))
12 plt.scatter(x_9d[:,0],x_9d[:,2], c= label_color, alpha=0.5)
13 plt.show()

hidden_relationships-05.py hosted with ❤ by GitHub view raw

Output:

https://medium.com/hackernoon/unsupervised-machine-learning-for-fun-profit-with-basket-clusters-17a1161e7aa1 10/16
8/27/2021 Generating Alpha with Vectorspace AI NLP/NLU Correlation Matrix Datasets: Equities vs The Periodic Table of Elements | by Gaëtan Rickter | Hacker…

This K-Means plot looks more promising now as if our simple clustering model
assumption turns out to be right, we can observe 3 distinguishable clusters via this color
visualization scheme.

Of course, there are many different ways to cluster and visualize a dataset like this as
shown here.

Using seaborn’s convenient pairplot function I can automatically plot all the features in
the dataframe in pairwise manner. We can pairplot the first 3 projections against one
another and visualize:

1 # Create a temp dataframe from our PCA projection data "x_9d"


2 df = pd.DataFrame(x_9d)
3 df = df[[0,1,2]]
4 df['X_cluster'] = X_clustered
5
6 # Call Seaborn's pairplot to visualize our KMeans clustering on the PCA projected data
7 sb.pairplot(df, hue='X_cluster', palette='Dark2', diag_kind='kde', size=1.85)
8 sb.plt.show()

hidden_relationships_clusters.py hosted with ❤ by GitHub view raw

Output:

https://medium.com/hackernoon/unsupervised-machine-learning-for-fun-profit-with-basket-clusters-17a1161e7aa1 11/16
8/27/2021 Generating Alpha with Vectorspace AI NLP/NLU Correlation Matrix Datasets: Equities vs The Periodic Table of Elements | by Gaëtan Rickter | Hacker…

Building Basket Clusters


How you fine tune your clusters is up to you. There’s no silver bullet for this and much of
it depends on the context in which you’re operating in. In this case, stocks, equities and
the financial markets defined by hidden relationships.

Once you’re satisfied with your clusters and have set scoring thresholds to control
whether certain stocks qualify for a cluster you can then extract the stocks for a given
cluster and trade them as baskets or use the baskets as signals. The list of things you can
do with this kind of approach is largely based on your creativity and how well you might
be able to optimize using deep learning variants to optimize the returns of each cluster
based on which concepts to cluster or data points such as the size of a company’s short
interest or float (available shares on the open market).

You might notice a few interesting traits in the way these clusters trade as baskets.
Sometimes there’s divergence from the S&P or general Market. This can offer

https://medium.com/hackernoon/unsupervised-machine-learning-for-fun-profit-with-basket-clusters-17a1161e7aa1 12/16
8/27/2021 Generating Alpha with Vectorspace AI NLP/NLU Correlation Matrix Datasets: Equities vs The Periodic Table of Elements | by Gaëtan Rickter | Hacker…

opportunities for arbitrage based essentially on ‘information arbitrage’. Some clusters


can correlate to Google search trends.

It might be interesting to see clusters related to materials and their supply chain as
mentioned in this article: “Zooming in on 10 materials and their supply chains”. Using
the dataset, I only operated on the feature column labels: ‘Cobalt’, ‘Copper’, ‘Gallium’
and ‘Graphene’ just to see if I might uncover any interesting hidden connections between
public companies working in this area or exposed to risk in this area. These baskets are
also compared against the returns of the S&P (SPY).

By using historical price data, which is readily available at outlets like Quantopian,
Numerai, Quandl or Yahoo Finance, you can then aggregate price data to generate
projected returns visualized using HighCharts:

https://medium.com/hackernoon/unsupervised-machine-learning-for-fun-profit-with-basket-clusters-17a1161e7aa1 13/16
8/27/2021 Generating Alpha with Vectorspace AI NLP/NLU Correlation Matrix Datasets: Equities vs The Periodic Table of Elements | by Gaëtan Rickter | Hacker…

The returns I gained from the cluster above beat the S&P by a nice margin, which means
you would have approximately an extra 10% over the S&P annual. I’ve seen more
aggressive approaches net close to 70% annual. Now I have to admit that I do a few
other things that I have to keep black-boxed due to the nature of my work, but from what
I’ve observed so far, at least exploring or wrapping new quantitative models around this
approach could turn out to be quite worth it and with the only downside being a
different kind of signal you could pipe into another system.

Generating short basket clusters could be more profitable than long basket clusters. This
approach needs its own article and before the next Black Swan event.

Getting ahead of parasitic, symbiotic and sympathetic relationships between public


companies that share known and hidden relationships can be fun and profitable if you’re
into machine learning. In the end, one’s ability to profit seems to be all about how clever
they can get in coming up with powerful combinations of feature labels, or “concepts”,
when generating these kinds of datasets.

My next iteration on this kind of model should probably include a separate algorithm for
auto-generating feature combinations or unique lists. Perhaps based on near real-time
events that might affect groups of stocks with hidden relationships that only humans,
outfitted with unsupervised machine learning algorithms, can predict.

Gaëtan R., [email protected] — Financial Data Consultant — Geneva


Switzerland

August 27, 2021

Rabbut – Lifestyle & Relationships


Powered by Rabbut

 MAIN MENU 

We have moved!
https://medium.com/hackernoon/unsupervised-machine-learning-for-fun-profit-with-basket-clusters-17a1161e7aa1 14/16
8/27/2021 Generating Alpha with Vectorspace AI NLP/NLU Correlation Matrix Datasets: Equities vs The Periodic Table of Elements | by Gaëtan Rickter | Hacker…

We are no longer using this subdomain for Rabbut.

Please join us in our main domain to read all the articles we have published under Rabbut.

CLICK here to read all the posts on Rabbut.

Hacker Noon is how hackers start their afternoons. We’re a part of the @AMI family. We are
now accepting submissions and happy to discuss advertising & sponsorship opportunities.

If you enjoyed this story, we recommend reading our latest tech stories and trending tech
stories. Until next time, don’t take the realities of the world for granted!

Sign up for Get Better Tech Emails via HackerNoon.com


By HackerNoon.com

how hackers start their afternoons. the real shit is on hackernoon.com. Take a look.
https://medium.com/hackernoon/unsupervised-machine-learning-for-fun-profit-with-basket-clusters-17a1161e7aa1 15/16
8/27/2021 Generating Alpha with Vectorspace AI NLP/NLU Correlation Matrix Datasets: Equities vs The Periodic Table of Elements | by Gaëtan Rickter | Hacker…

Your email

Get this newsletter

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information
about our privacy practices.

Quantitative Strategies Cryptocurrency Bitcoin Investing Stocks

About Write Help Legal

Get the Medium app

https://medium.com/hackernoon/unsupervised-machine-learning-for-fun-profit-with-basket-clusters-17a1161e7aa1 16/16

You might also like