Qlib: An AI-oriented Quantitative Investment Platform

1) Qlib is a new AI-oriented Quantitative Investment Platform that aims to assist research exploring AI's potential in quantitative investment and empower researchers to create value from AI-driven quantitative strategies. 2) Traditional quantitative investment workflows are being revolutionized by AI technologies, requiring upgraded infrastructure to support end-to-end solutions. AI technologies also face unique challenges in financial scenarios. 3) Qlib is designed to accommodate AI-based solutions with high-performance infrastructure dedicated for quantitative investment. It provides tools to help users fully utilize AI and guides the application of AI in finance.

Uploaded by

jayeshrane2107

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

315 views8 pages

Qlib: An AI-oriented Quantitative Investment Platform

Uploaded by

jayeshrane2107

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Qlib : An AI-oriented Quantitative Investment Platform

Xiao Yang , Weiqing Liu , Dong Zhou , Jiang Bian and Tie-Yan Liu
Microsoft Research
{Xiao.Yang, Weiqing.Liu, Zhou.Dong, Jiang.Bian, Tie-Yan.Liu}@microsoft.com
arXiv:2009.11189v1 [q-fin.GN] 22 Sep 2020

Abstract perspectives. First, the technological revolution in the quan-

titative investment workflow, caused by the flexibility of AI
Quantitative investment aims to maximize the re- technologies, tends to require new supportive infrastructure.
turn and minimize the risk in a sequential trad- For example, while the traditional quantitative investment
ing period over a set of financial instruments. Re- usually splits the whole workflow into a couple of sub-tasks,
cently, inspired by rapid development and great po- including stock trend prediction, portfolio optimization, etc.,
tential of AI technologies in generating remark- AI technologies make it possible to establish an end-to-end
able innovation in quantitative investment, there solution that generates the final portfolio directly. To support
has been increasing adoption of AI-driven work- such end-to-end solution, it is necessary to upgrade the cur-
flow for quantitative research and practical invest- rent infrastructure due to its data-driven nature.
ment. In the meantime of enriching the quan- Meanwhile, the AI technologies have to deal with the
titative investment methodology, AI technologies unique problems in some new scenarios, which require both
have raised new challenges to the quantitative in- plenty of domain knowledge in finance and rich experience in
vestment system. Particularly, the new learning data science. Applying the solutions to quantitative research
paradigms for quantitative investment call for an in- tasks without any domain adaptation rarely works. Such a
frastructure upgrade to accommodate the renovated circumstance leads to urgent demands for a platform to ac-
workflow; moreover, the data-driven nature of AI commodate such a modern quantitative research workflow in
technologies indeed indicates a requirement of the the age of AI and provide guidance for the application of AI
infrastructure with more powerful performance; ad- technologies in the financial scenario.
ditionally, there exist some unique challenges for Therefore, we propose a new AI-oriented Quantitative In-
applying AI technologies to solve different tasks vestment Platform called Qlib1 . It aims to assist the research
in the financial scenarios. To address these chal- efforts of exploring the great potential of AI technologies
lenges and bridge the gap between AI technologies in quantitative investment as well as empower quantitative
and quantitative investment, we design and develop researchers to create more significant values on AI-driven
Qlib that aims to realize the potential, empower the quantitative investment. Specifically, the AI-oriented frame-
research, and create the value of AI technologies in work of Qlib is designed to accommodate the AI-based solu-
quantitative investment. tions. Moreover, it provides high-performance infrastructure
dedicated for quantitative investment scenario, which makes
many AI research topics possible. In addition, a batch of tools
1 Introduction designed for machine learning in the quantitative investment
Quantitative investment, one of the hottest research fields, scenario is integrated with Qlib to benefit users in making
has been attracting numerous brilliant minds from both the fully utilization of AI technologies.
academia and financial industry. In the last decades, with At last, we demonstrate some use cases and evaluate the
continuous efforts in optimizing the quantitative methodol- performance of the infrastructure of Qlib by comparing sev-
ogy, the whole community of professional investors has sum- eral solutions for a typical task in quantitative investment.
marized a well-established yet imperfect quantitative research The results show the infrastructure of Qlib dedicated to quan-
workflow. Recently, emerging AI technologies start a new titative investment outperforms most of existing solutions on
trend in this research field. With increasing attention to ex- this task.
ploring AI’s great potential in quantitative investment, AI
technologies have been widely adopted in the practical in- 2 Background and Related Works
vestment by quantitative researchers. In this section, we will first demonstrate the major practical
While AI technologies have been enriching the quantita- problems of a modern quantitative researcher when applying
tive investment methodology, they also put forward new chal-
1
lenges to the quantitative investment system from multiple The code is available at https://github.com/microsoft/qlib
of AI technologies in quantitative investment, which moti- topics impossible. Such circumstances put forward more
vates the birth of Qlib. After that, we will briefly introduce stringent performance requirements for the infrastructure.
the related work.
Obstacles to apply machine learning solutions
The financial data and task have their uniqueness and chal-
2.1 Practical Problems lenges. Applying the machine learning solutions to quantita-
Quantitative research workflow revolution tive research tasks without any adaptation rarely works. Due
In the traditional investment research workflow, researchers to the extremely low SNR(Signal to Noise Ratio) in financial
often develop trading signals by linear models [Petkova, data, it is very hard to build a successful data-driven strategy
2006] or manually designed rules[Murphy, 1999] based on in financial markets. Most machine learning algorithms are
several factors(factors are similar to features in machine data-driven and have to deal with such difficulties. Without
learning) and basic financial data. And then, a trading strat- carefully handling the details, machine learning models can
egy(typically Barra[Sheikh, 1996]) is followed to generate hardly achieve satisfying performance. Even a minor mis-
the target portfolio. At last, researchers evaluate the trading take can make the model over-fit the noise rather than learn
signal and portfolio by a back-testing function. effective patterns. Rightly handling the details requires a lot
of domain knowledge of the financial industry. Moreover, the
With the rise of AI technologies, it launches a technologi-
typical objectives, such as annualized return, are often not dif-
cal revolution of traditional quantitative investment. The tra-
ferentiable, which makes it hard to train models directly for
ditional quantitative research workflow is too primitive to ac-
machine learning methods. Defining a reasonable task with
commodate such flexible technologies. In order to show the
appropriate supervised targets is very important for model-
difference more intuitively, we’ll demonstrate a typical mod-
ing the finance data. Such barriers daunt quite a lot of data
ern research workflow based on AI technologies. It starts with
scientists without much domain knowledge of the financial
a dataset with lots of features(typically more than hundreds
industry.
of dimensions). Manually designing such amount of features
Another necessary step to build a machine learning ap-
takes lots of time. It is common to leverage machine learning
plication is hyperparameter optimization. Different machine
algorithms to generate such features automatically[Potvin et
learning algorithms have different hyperparameter search
al., 2004; Neely et al., 1997; Allen and Karjalainen, 1999;
spaces, each of which has multiple dimensions with different
Kakushadze, 2016]. Generating data [Feng et al., 2019]
meanings and priorities. Some of the quantitative researchers
is another option for constructing a dataset. Based on di-
come from the traditional financial industry and don’t have
verse datasets, researchers have provided hundreds of ma-
much knowledge about machine learning. Such huge learning
chine learning methods to mine trading signals [Sezer et al.,
cost stops many users from giving full play to the maximum
2019]. Researchers could generate the target portfolio based
value of machine learning.
on such trading signals. But such a workflow is not the
only choice. Instead of dividing a task into several stages, 2.2 Related Work
RL(reinforcement learning) provides an end-to-end solution
In the financial industry, an investment strategy will become
from the data to the final trading actions directly [Deng et
less profitable with more investors following it. Therefore,
al., 2016]. RL optimizes the trading strategy by interacting
the financial practitioners, especially quantitative researchers,
with the environment, which is a trading simulator in the fi-
are never keen to share their own algorithms and tools. OLPS
nancial scenario. RL needs a responsive simulator instead of [Li et al., 2016] is the first open-source toolbox for portfo-
a back-testing function in the traditional research workflow.
lio selection. It consists of a family of classical strategies
Moreover, most of the AI algorithms have complicated hy-
powered by machine learning algorithms as benchmarks and
perparameters, which need to be tuned carefully.
toolkit to facilitate the development of new learning meth-
AI technologies are so flexible and already beyond the ods. This toolbox only supports Matlab and Octave, which is
scope of existing tools designed for traditional methodolo- not compatible with current scientific mainstream language
gies. Building a research workflow based on AI technologies Python and thus not friendly to the modern machine learn-
from scratch takes much time. ing algorithms. Its framework is quite simple, and modern
High performance requirements for infrastructure quantitative research workflow based on AI technologies is
much more complicated. Other quantitative tools emerge in
With the emerging of AI technologies, the requirements for recent years. QuantLib [Firth, 2004] only focuses part of
infrastructure have changed. Such a data-driven method modern quantitative research workflow. QUANTAXIS 2 fo-
could leverage a huge amount of data. The amount of data cuses more on the IT infrastructure instead of the research
could reach the order of TB magnitude in the scenario of workflow. Quantopian releases a series of open-source tools
high-frequency trading. Besides, it is very common to de- 1) Alphalens: a Python Library for performance analysis of
rive thousands of new features (e.g., Alpha101 [Kakushadze, predictive (alpha) stock factors 2) Zipline: an event-driven
2016]) from the basic price and volume data, which con- system for back-testing 3) Pyfolio: a Python library for per-
sist of only five dimensions in total. Some researchers formance and risk analysis of financial portfolios. All of them
even try to create new factors or features by searching ex- only focus on the analysis of trading signals or an investment
pressions [Allen and Karjalainen, 1999; Neely et al., 1997; portfolio.
Potvin et al., 2004]. Such heavy work of data processing
2
overburdens the researchers and even make some research https://github.com/QUANTAXIS/QUANTAXIS
Overall, Qlib is the first open-source platform that accom- Highly user-
Normal Module Data Flow
customizable
modates the workflow of a modern quantitative researcher
Executed Result
in the age of AI. It aims to empower every quantitative re-

Intraday
Trading
Order Executor Order Executor
searcher to realize the great potential of AI technologies in Creator

Portfolio & Orders
quantitative investment.

Strategy
Interday
Portfolio Generator Portfolio Genrator
Creator
Forecasting/Alpha
Alpha Portfolio Execution
3 AI-oriented Quantitative Investment Ensemble Ensemble Creator
Analyser Analyser Analyser

Interday Model
Platform Models Model Manager
Alpha Analysis Portfolio Analysis Execution
Report Report Analysis
return risk return risk Report
3.1 Overall Design Model
Model
Model
Model
Model
Model

In the cooperation with the quantitative researcher with years Data

Data Layer
Model Creator Feedback
of hands-on experience in the financial market, we’ve en- Data Enhancement

countered all of the above problems and explored all kinds of Data Server
solutions. Motivated by current circumstances, we implement
Qlib to apply AI technologies in quantitative investment. Static Workﬂow Dynamic Modeling Analysis

AI-oriented framework Qlib is designed in a modularized Figure 1: modules and a typical workflow built with Qlib
way based on modern research workflow to provide the max-
imum flexibility to accommodate AI technologies. Quan-
titative researchers could extend the modules and build a help the quantitative researchers build a whole research work-
workflow to try their ideas efficiently. In each module, Qlib flow with minimal efforts 3) and leave them the maximal flex-
provides several default implementation choices which work ibility to explore problems they are interested without getting
very well in practical investment. With these off-the-shelf distracted by other parts.
modules, quantitative researchers could focus on the problem Such a target leads to a modularized design from the per-
they are interested in a specific module without distracted by spective of system design. The system is split into several
other trivial details. Besides code, computation and data can individual modules based on the modern practical research
also be shared in some modules, so Qlib is designed to serve workflow. Most of the quantitative investment research direc-
users as a platform rather than a toolbox. tions, no matter traditional or AI-based, could be regarded as
High-performance infrastructure The performance of implementations of one or multiple modules’ interfaces. Qlib
data processing is important to data-driven methods like AI provides several typical implementations that work well in
technologies. As an AI-oriented platform, Qlib provides a practical investment for users in each module . Moreover, the
high-performance data infrastructure. Qlib provides a time- modules provide the flexibility for researchers to override ex-
series flat-file database 3 . Such a database is dedicated to sci- isting methods to explore new ideas. With such a framework,
entific computing on finance data. It greatly outperforms cur- researchers could try new ideas and test the overall perfor-
rent popular storage solutions like general-purpose databases mance with other modules with minimal cost.
and time-series databases on some typical data processing The modules of Qlib are listed in Figure 1 and connected
tasks in quantitative investment research. Furthermore, the in a typical workflow. Each module corresponds to a typi-
database provides an expression engine, which could acceler- cal sub-task in quantitative investment. A implementation in
ate the implementation and computation of factors/features, the module can be regarded as a solution for this task. We’ll
which make research topics that rely on expressions compu- introduce each module and give some related examples of ex-
tation possible. isting quantitative research to show how Qlib accommodate
them.
Guidance for machine learning Qlib has been integrated It starts with the Data Server module in the bottom left
with some typical datasets for quantitative investment, on corner, which provides a data engine to query and process
which typical machine learning algorithms could success- raw data. With retrieved data, researcher could build his
fully learn patterns with generalization ability. Qlib pro- own dataset in the Data Enhancement module. Researchers
vides some basic guidance for machine learning users and have tried a lot solutions to build better datasets by exploring
integrates some reasonable tasks which consist of reasonable and constructing effective factors/features[Potvin et al., 2004;
feature space and target label. Some typical hyperparameter Neely et al., 1997; Allen and Karjalainen, 1999; Kakushadze,
optimization tools are provided. With guidance and reason- 2016]. Generating datasets for training[Feng et al., 2019] is
able settings, machine learning models could learn patterns another research direction to provide datasets solution. The
with better generalization ability instead of just over-fitting Model Creater module learns models based on datasets. In
the noise. recent years, numerous researchers have explored all kinds of
models to mine trading signals from financial dataset[Sezer
3.2 AI-oriented Framework et al., 2019]. Moreover, meta-learning [Vilalta and Drissi,
Figure 1 shows the overall framework of Qlib. This frame- 2002] that tries to learn to learn provides a new learning
work aims to 1) accommodate the modern AI technology, 2) paradigm for the Model Creator module. Given plenty of
methods to model the financial data in a modern research
3
https://en.wikipedia.org/wiki/Flat-file database workflow, the model management system has become a nec-
essary part of the workflow. The Model Manager module new data is necessary. The formalized update operation is
is designed to handle such problems for modern quantita-
tive researchers. With diverse models, ensemble learning is BasicDataT =OldBasicDataT ∪ {xi,t,anew }
quite an effective way to enhance the performance and ro- BasicDataT +1 =BasicDataT ∪ {xi,T +1,a }
bustness of machine learning models, and it is frequently P oolT +1 =P oolT ∪ {poolt+1 }
used in the financial area[Qiu et al., 2014; Yang et al., 2017;
Zhao et al., 2017]. It is supported by Model Ensemble mod- User queries can be formalized as
ule. Portfolio Generator module aims to generate a port-
folio from trading signals output by models, which is known DataQuery = {xi,t,a |it ∈ poolt , poolt ∈ P oolquery
as portfolio management[Qian et al., 2007]. Barra [Sheikh, a ∈ Attrquery , timestart ≤ t ≤ timeend }
1996] provides the most popular solution for this task. With
the target portfolio, we provide a high-fidelity trading simu- which represents data query of some attributes of instruments
lator, Orders Executor module, to examine the performance in a specific time range in a specific pool.
of a strategy and Analyser modules to automatically analyze Such requirements are quite simple. Many off-the-shelf
the trading signals, portfolio and execution results. The Order open-source solutions support such operations. We classify
Executor module is designed as a responsive simulator rather them into three categories and list the popular implementa-
than a back-testing function, which could provide the infras- tions in each category.
tructure for some learning paradigm(e.g., RL) that requires • General-purpose database: MySQL[MySQL, 2001],
feedback of the environment produced by the Analyser mod- MongoDB[Chodorow, 2013]
ules.
The data in quantitative investment are in time-series for- • Time-series database: InfluxDB [Naqvi et al., 2017]
mat and updated by time. The size of in-sample dataset in- • Data file for scientific computing: Data organized
creases by time. A typical practice to leverage the new data is by numpy[Oliphant, 2006] array or pandas[McKinney,
to update our models regularly [Wang et al., 2019b]. Besides 2011] dataframe
better utilization of increasing in-sample data, dynamically The general-purpose database supports data with diverse
updating models [Yang et al., 2019] and trading strategies formats and structures. Besides, it provides lots of sophis-
[Wang et al., 2019a] will improve the performance further
ticated mechanisms, such as indexing, transaction, entity-
due to dynamic nature of the stock market[Adam et al., 2016]. relationship model, etc. Most of them add heavy dependen-
Therefore, it is obviously not the optimal solution to use a cies and unnecessary complexity to a specific task rather than
set of static model and trading strategies in Static Workflow. solving the key problems in a specific scenario. The time-
Dynamic updating of models and strategies is a important re- series database optimizes the data structures and queries for
search direction in quantitative investment. The modules in time-series data. But they are still not designed for quanti-
the Dynamic Modeling provide interfaces and infrastructure tative research, where the data are usually in compact array-
to accommodate such solutions. based format for scientific computation to take advantage of
hardware acceleration. It will save a great amount of time if
3.3 High Performance Infrastructure
the data keep the compact array-based format from the disk
Financial data to the end of clients without format transformation. How-
We’ll summarise the data requirements in quantitative re- ever, both general-purpose and time-series database store and
search in this section. In quantitative research, the most transfer the data in a different format for the general purpose,
frequently-used format of data follow such format which is inefficient for scientific computation.
Due to the inefficiency of databases, array-based data gain
BasicDataT = {xi,t,a } , i ∈ Inst, t ∈ T ime, a ∈ Attr popularity in the scientific community. Numpy array and pan-
where xi,t,a is the value of basic type(e.g. float, int), Inst das dataframe are the mainstream implementations in scien-
denotes the financial instruments set(e.g. stock, option, etc.), tific computation, which are often stored as HDF5 or pickle6
T ime denotes the timestampes set(e.g. trading days of stock on the disk. Data in such formats have light dependencies
market), Attr denotes the possible attributes set of an instru- and are very efficient for scientific computing. However, such
ment(e.g. open price, volume, market value), T denote the data are stored in a single file and hard to update or query.
latest timestamp of the data(e.g. the latest trading date). xi,t,a After an investigation of above storage solutions, we find
denotes the value of attribute a of instrument i at time t. none could fit the quantitative research scenario very well. It
Besides, instruments pools are necessary information to is necessary to design a customized solution for quantitative
specify a set of financial instruments which change over time research.

P oolT = {poolt } , t ∈ T ime, poolt ⊆ Inst File storage design

Figure 2 demonstrates the file storage design. As shown in the
S&P 500 Index 4 is a typical example of P ool. left part of the figure, Qlib organize files in a tree structure.
Data update is an essential feature. The existing historical Data are separated into folders and files according to different
data will not change over time. Only the append operation of
5
https://en.wikipedia.org/wiki/Hierarchical Data Format
4 6
https://en.wikipedia.org/wiki/S%26P 500 Index https://docs.python.org/3/library/pickle.html
daily for a speciﬁc frequency size = (N + 1) * 4 Bytes
engine is an essential tool for such a research topic.
features attributes

GOOGL instrument
Start timestamp Index value1 value 2 value N
Cache system
Time
Reference Fixed-width binary data
open.bin
1.Orgianl Binary Data 2.ExpressionCache 3.DatasetCache
Timestamp 1 Timestamp 2 Timestamp T
Cache for Saving Calculation Time Cache For Saving Combination Time

Update
instruments pools Update

Stocks

Stocks
Expression
Dataset
Cache
Cache
sp500.txt
MSFT MSFT
GOOGL 2004-08-19
MSFT 1986-03-13 close.bin high/close
time time
AAPL 1980-12-12

At
open.bin GOOGL open/close GOOGL time

tr
calendar.txt shared timeline AMZN 1997-05-15

an
d
...

F(
At

At
F(
tr

tr)
At
tr)
Figure 2: The description of the flat-file database; the left part is the Figure 3: The disk cache system of Qlib; expression cache for saving
structure of files; the right part is the content of files time of expression computation; dataset cache for saving time of
data combination

frequencies, instruments and attributes. All the values of at-

tributes are stored in binary data in a compact fixed-width for- To avoid replicated computation, Qlib has a built-in cache
mat so that indexing by bytes becomes possible. The shared system. It consists of memory cache and disk cache.
timeline is stored separately in a file named ”calendar.txt”. In-memory cache When Qlib computes factors/features
The data file of attribute values sets its first 4 bytes to the in- with its expression engine, it parses the expression into a syn-
dex value of the timeline to indicates the start timestamp of tax tree. All computed results of nodes will be stored in an
the series of data. With the start time index, Qlib could align LRU(Least Recently Used) cache in memory. The replicated
all the values on the time dimension. computation of same (sub-)expressions can be saved.
The data are stored in a compact format, which is effi-
Disk cache A typical workflow of data processing in quan-
cient to be combined into arrays for scientific computation.
titative investment can be divided into three steps: fetching
While it achieves high performance like array-based data in
original data, computing expressions and combining data into
scientific computation, it meets data update requirements in
arrays for scientific computation. Computing expressions and
the quantitative investment scenario. All data are arranged
combining data are very time-consuming. It could save much
in the order of time. New data could be updated by ap-
time if we can cache the shared intermediate data. In prac-
pending, which is quite efficient. Adding and removing at-
tical data processing tasks, many intermediate results can be
tributes or instruments are quite straightforward and efficient,
shared. For example, the same expression computation can
because they are stored in separate files. Such a design is ex-
be shared by different data processing tasks. Therefore Qlib
tremely light-weighted. Without the overheads of databases,
designed a 2-level disk cache mechanism. The cache sys-
Qlib achieves high performance.
tem is shown in Figure 3. The left part is the original data
Expression Engine we described in Section 3.3. The first level is expression
It is quite a common task to develop new factors/features cache, which will save all the computed expressions to the
based on basic data. Such a task takes a large proportion of disk cache. The data structure of the expression cache is
the time of many quantitative researchers. Both Implement the same as the original data. With the expression cache,
such factors by code, and the computation process is time- the same expression will be computed only once. After the
consuming. Therefore, Qlib provides an expression engine to expression cache is dataset cache, which stores the combined
minimize the effort of such tasks. data to save the combination time. The cache data of both lev-
Actually, the nature of factors/features is a function that els are arranged by time and indexable on the time dimension,
transforms the basic data into the target values. The func- so the disk cache can be shared even when the query time
tion could break down into a combination of a series of ex- changes. Moreover, Qlib support data update by appending
pressions. The expression engine is designed based on this new data thanks to the data arrangement by time. The main-
idea. With this expression engine, quantitative researchers tenance of the data is much easier with such a mechanism.
could implement new factors/features by writing expressions
instead of complicated code. For example, The Bollinger 3.4 Guidance for Machine Learning
Band technical indicator [Bollinger, 2002] is a widely used As we discussed in Section 2, guidance for machine learning
technical factor and its upper bounds can be implemented by algorithms is very important. Qlib provides typical datasets
just a simple expression ”(MEAN($close, N)+2*STD($close, for machine learning algorithms. Some typical task settings
N)-$close)/MEAN($close, N)” with the expression engine. can be found in Qlib , such as data pre-processing, learn-
Such an implementation is simple, readable, reusable and ing targets, etc. Researchers don’t have to explore everything
maintainable. Users can easily build a dataset with just a se- from scratch. Such guidances provide lots of domain knowl-
ries of simple expressions. Searching expressions to construct edge for researchers to start their journey in this research area.
effective trading signals is a typical research topic, which has For most machine learning algorithms, hyperparameter op-
been explored by many researchers [Allen and Karjalainen, timization is a necessary step to achieve better generaliza-
1999; Neely et al., 1997; Potvin et al., 2004]. An expression tion. Although it is important, it takes a lot of effort and is
HDF5 MySQL MongoDB InfluxDB Qlib -E -D Qlib +E -D Qlib +E +D
Storage(MB) 287 1,332 911 394 303 802 1,000
Load Data(s) 0.80±0.22 182.5±4.2 70.3±4.9 186.5±1.5 0.95±0.05 4.9±0.07 7.4±0.3
Compute Expr.(s) 179.8±4.4 137.7±7.6 35.3±2.3 -
Convert Index(s) - 3.6±0.1 -
Filter by Pool(s) 3.39 ±0.24 -
Combine data(s) 1.19±0.30 -
Total (1CPU) (s) 184.4±3.7 365.3±7.5 253.6±6.7 368.2±3.6 147.0±8.8 47.6±1.0 7.4±0.3
Total(64CPUs) (s) - 8.8±0.6 4.2±0.2 -

Table 1: Performance comparison of different storage solutions

quite repetitive. Therefore, Qlib provides a Hyperparame-

ters Tuning Engine(HTE) to make such a task easier. HTE
provides an interface to define a hyperparameter search space
Θ and then search the best hyperparameters θ automatically.
In a typical financial task of modeling time-series data, the
new data comes in sequence by time. To leverage the new
data, models have to be re-trained on new data periodically.
The new best hyperparameters θ change but are often close to
previous best hyperparameters. HTE provides a mechanism
dedicated to hyperparameter optimization on financial tasks. Figure 4: A Configuration example of CDPE
It generates a new distribution for hyperparameters search
space for better a chance to reach the best point with fewer
The task for the solutions is to create a dataset from the
trials. The distribution for searching θ can be formalized as
basic OHLCV7 daily data of a stock market, which involves
pprior (x)ϕθprev ,σ2 (x) data query and processing. The final dataset consists of 14
pnew (x) = factors/features derived from OHLCV data(e.g. ”Std($close,
Ex∼pprior [ϕθprev ,σ2 (x)]
5)/$close”). The time of the data ranges from 1/1/2007 to
where pprior is the original hyperparameters search space; 1/1/2020. The stock pool consists of 800 stocks each day,
ϕθprev ,σ2 (x) ∼ N (θprev , σ 2 ); θprev is the best hyperparam- which changes daily.
eter in last model training. The domain of hyperparameter Besides the comparison of the total time of each solution,
search space remains the same, but the probability density we break down the task into following steps for more details.
around θprev increases.
• Load Data Load the OHCLV data or cache into RAM
4 Use Case & Performance Evaluation as the array-based format for scientific computation.
4.1 Use Case • Compute Expr. Compute the derived factors/features.
Qlib provide a Config-Driven Pipeline Engine(CDPE) to • Convert Index It only applies to Qlib. Because Qlib
help researchers build the whole research workflow show in doesn’t store the indices(i.e., timestamp, stock id) in the
Figure 1 easier. The user could define a workflow with just original data, it has to set up data indices.
a simple config file like List ??(some trivial details are re- • Filter data Filter the stock data by a specific pool. For
placed by ”...”). Such an interface is not mandatory, and we example, SP500 involves more than 1 thousand stock in
leave the maximal flexibility to users to build a quantitative total, but it only includes 500 stocks daily. The data not
research workflow by code like building blocks. included in SP500 on a specific day should be filtered
out, though it has ever been in SP500. It is impossible to
4.2 Performance Evaluation filter out data when loading data, because some derived
The performance of data processing is important to data- features rely on historical OHLCV data.
driven methods like AI technologies. As an AI-oriented plat-
form, Qlib provides a solution for data storage and data pro- • Combine data Concatenate all the data of different
cessing. To demonstrate the performance of Qlib, We com- stocks into a single piece of array-based data
pare Qlib with several other solutions discussed in Section As we can seen in Table 1. Qlib’s compact storage achieves
3.3, which includes HDF5, MySQL, MongoDB, InfluxDb and similar size and loading speed as the dedicated scientific
Qlib. The Qlib +E -D indicates Qlib with expression cache
7
enabled and dataset cache disabled, and so forth. The open, high, low, close price and trading volume of a stock
HDF5 data file. The databases take too much time on loading [Deng et al., 2016] Yue Deng, Feng Bao, Youyong Kong,
data. After looking into the underlying implementation, we Zhiquan Ren, and Qionghai Dai. Deep direct reinforce-
find that data go through too many layers of interfaces and ment learning for financial signal representation and trad-
unnecessary format transformations in both general-purpose ing. IEEE transactions on neural networks and learning
database and time-series database solution. Such overheads systems, 28(3):653–664, 2016.
greatly slow down the data loading process. Due to the mem- [Feng et al., 2019] Fuli Feng, Huimin Chen, Xiangnan He,
ory cache of Qlib, Qlib -E -D saves about 24% of the time of
Ji Ding, Maosong Sun, and Tat-Seng Chua. Enhancing
Compute Expr. Moreover, Qlib provides expression cache
stock movement prediction with adversarial training. In
and dataset cache mechanism. With expression cache en-
Proceedings of the 28th International Joint Conference
abled in Qlib +E -D, 80.4% of the time for Compute Expr.
on Artificial Intelligence, pages 5843–5849. AAAI Press,
is saved if no expression cache is missed. Combining the fac-
2019.
tors/features into one piece of array-based data for each stock
accounts for the major time consuming of Qlib +E -D, which [Firth, 2004] N Firth. Why use quantlib. Paper available
is included in the Compute Expr. step. Besides the computa- at: http://www. quantlib. co. uk/publications/quantlib. pdf,
tion cost, the most time-consuming step is data combination. 2004.
The dataset cache is designed to reduce such overheads. As
[Kakushadze, 2016] Zura Kakushadze. 101 formulaic al-
shown in the column Qlib +E +D, the time cost is further re-
duced. phas. Wilmott, 2016(84):72–81, 2016.
Moreover, Qlib can leverage multiple CPU cores to accel- [Li et al., 2016] Bin Li, Doyen Sahoo, and Steven CH Hoi.
erate computation. As we can see in the last line of Tabel 1, Olps: a toolbox for on-line portfolio selection. The Journal
the time cost is significantly reduced for Qlib with multiple of Machine Learning Research, 17(1):1242–1246, 2016.
CPUs. Qlib +E +D can’t be accelerated further due to it just [McKinney, 2011] Wes McKinney. pandas: a foundational
reads the existing cache and almost computes nothing.
python library for data analysis and statistics. Python for
High Performance and Scientific Computing, 14, 2011.
4.3 More about Qlib
[Murphy, 1999] John J Murphy. Technical analysis of the fi-
Qlib an opensource platform in continuous development.
nancial markets: A comprehensive guide to trading meth-
More detailed documentations can be found in its github
ods and applications. Penguin, 1999.
repository 8 . A lot of features(e.g. data service with client-
server architecture, analysis system, automatic deployment [MySQL, 2001] AB MySQL. Mysql, 2001.
on the cloud) not introduced in detail in this paper could be [Naqvi et al., 2017] Syeda Noor Zehra Naqvi, Sofia Yfanti-
found in the online repository. Your contributions are wel-
dou, and Esteban Zimányi. Time series databases and in-
comed.
fluxdb. Studienarbeit, Université Libre de Bruxelles, 2017.
[Neely et al., 1997] Christopher Neely, Paul Weller, and Rob
5 Conclusion
Dittmar. Is technical analysis in the foreign exchange mar-
In this paper, we present practical problems of modern quan- ket profitable? a genetic programming approach. Jour-
titative researchers in the age of AI. Based on these practical nal of financial and Quantitative Analysis, 32(4):405–426,
problems, we design and implement Qlib that aims to em- 1997.
power every quantitative researcher to realize the great po- [Oliphant, 2006] Travis E Oliphant. A guide to NumPy, vol-
tential of AI-technologies in quantitative investment.
ume 1. Trelgol Publishing USA, 2006.
[Petkova, 2006] Ralitsa Petkova. Do the fama–french factors
References
proxy for innovations in predictive variables? The Journal
[Adam et al., 2016] Klaus Adam, Albert Marcet, and of Finance, 61(2):581–612, 2006.
Juan Pablo Nicolini. Stock market volatility and learning, [Potvin et al., 2004] Jean-Yves Potvin, Patrick Soriano, and
2016.
Maxime Vallée. Generating trading rules on the stock mar-
[Allen and Karjalainen, 1999] Franklin Allen and Risto Kar- kets with genetic programming. Computers & Operations
jalainen. Using genetic algorithms to find technical trad- Research, 31(7):1033–1047, 2004.
ing rules. Journal of financial Economics, 51(2):245–271, [Qian et al., 2007] Edward E Qian, Ronald H Hua, and
1999.
Eric H Sorensen. Quantitative equity portfolio manage-
[Bollinger, 2002] John Bollinger. Bollinger on Bollinger ment: modern techniques and applications. CRC Press,
bands. McGraw Hill Professional, 2002. 2007.
[Chodorow, 2013] Kristina Chodorow. MongoDB: the [Qiu et al., 2014] Xueheng Qiu, Le Zhang, Ye Ren, Pon-
definitive guide: powerful and scalable data storage. ” nuthurai N Suganthan, and Gehan Amaratunga. Ensemble
O’Reilly Media, Inc.”, 2013. deep learning for regression and time series forecasting.
In 2014 IEEE symposium on computational intelligence in
8
https://github.com/microsoft/qlib/ ensemble learning (CIEL), pages 1–6. IEEE, 2014.
[Sezer et al., 2019] Omer Berat Sezer, Mehmet Ugur
Gudelek, and Ahmet Murat Ozbayoglu. Financial time se-
ries forecasting with deep learning: A systematic literature
review: 2005-2019. arXiv preprint arXiv:1911.13288,
2019.
[Sheikh, 1996] Aamir Sheikh. Barra’s risk models. Barra
Research Insights, pages 1–24, 1996.
[Vilalta and Drissi, 2002] Ricardo Vilalta and Youssef
Drissi. A perspective view and survey of meta-learning.
Artificial intelligence review, 18(2):77–95, 2002.
[Wang et al., 2019a] Lewen Wang, Weiqing Liu, Xiao Yang,
and Jiang Bian. Conservative or aggressive? confidence-
aware dynamic portfolio construction. In 2019 IEEE
Global Conference on Signal and Information Processing
(GlobalSIP), pages 1–5. IEEE, 2019.
[Wang et al., 2019b] Shouxiang Wang, Xuan Wang,
Shaomin Wang, and Dan Wang. Bi-directional long
short-term memory method based on attention mechanism
and rolling update for short-term load forecasting. Inter-
national Journal of Electrical Power & Energy Systems,
109:470–479, 2019.
[Yang et al., 2017] Bing Yang, Zi-Jia Gong, and Wenqi
Yang. Stock market index prediction using deep neural
network ensemble. In 2017 36th Chinese Control Confer-
ence (CCC), pages 3882–3887. IEEE, 2017.
[Yang et al., 2019] Xiao Yang, Weiqing Liu, Lewen Wang,
Cheng Qu, and Jiang Bian. A divide-and-conquer frame-
work for attention-based combination of multiple invest-
ment strategies. In 2019 IEEE Global Conference on Sig-
nal and Information Processing (GlobalSIP), pages 1–5.
IEEE, 2019.
[Zhao et al., 2017] Yang Zhao, Jianping Li, and Lean Yu. A
deep learning ensemble approach for crude oil price fore-
casting. Energy Economics, 66:9–16, 2017.