Qlib: An AI-oriented Quantitative Investment Platform
Qlib: An AI-oriented Quantitative Investment Platform
Xiao Yang , Weiqing Liu , Dong Zhou , Jiang Bian and Tie-Yan Liu
Microsoft Research
{Xiao.Yang, Weiqing.Liu, Zhou.Dong, Jiang.Bian, Tie-Yan.Liu}@microsoft.com
arXiv:2009.11189v1 [q-fin.GN] 22 Sep 2020
Intraday
Trading
Order Executor Order Executor
searcher to realize the great potential of AI technologies in Creator
Portfolio & Orders
quantitative investment.
Strategy
Interday
Portfolio Generator Portfolio Genrator
Creator
Forecasting/Alpha
Alpha Portfolio Execution
3 AI-oriented Quantitative Investment Ensemble Ensemble Creator
Analyser Analyser Analyser
Interday Model
Platform Models Model Manager
Alpha Analysis Portfolio Analysis Execution
Report Report Analysis
return risk return risk Report
3.1 Overall Design Model
Model
Model
Model
Model
Model
Data Layer
Model Creator Feedback
of hands-on experience in the financial market, we’ve en- Data Enhancement
countered all of the above problems and explored all kinds of Data Server
solutions. Motivated by current circumstances, we implement
Qlib to apply AI technologies in quantitative investment. Static Workflow Dynamic Modeling Analysis
AI-oriented framework Qlib is designed in a modularized Figure 1: modules and a typical workflow built with Qlib
way based on modern research workflow to provide the max-
imum flexibility to accommodate AI technologies. Quan-
titative researchers could extend the modules and build a help the quantitative researchers build a whole research work-
workflow to try their ideas efficiently. In each module, Qlib flow with minimal efforts 3) and leave them the maximal flex-
provides several default implementation choices which work ibility to explore problems they are interested without getting
very well in practical investment. With these off-the-shelf distracted by other parts.
modules, quantitative researchers could focus on the problem Such a target leads to a modularized design from the per-
they are interested in a specific module without distracted by spective of system design. The system is split into several
other trivial details. Besides code, computation and data can individual modules based on the modern practical research
also be shared in some modules, so Qlib is designed to serve workflow. Most of the quantitative investment research direc-
users as a platform rather than a toolbox. tions, no matter traditional or AI-based, could be regarded as
High-performance infrastructure The performance of implementations of one or multiple modules’ interfaces. Qlib
data processing is important to data-driven methods like AI provides several typical implementations that work well in
technologies. As an AI-oriented platform, Qlib provides a practical investment for users in each module . Moreover, the
high-performance data infrastructure. Qlib provides a time- modules provide the flexibility for researchers to override ex-
series flat-file database 3 . Such a database is dedicated to sci- isting methods to explore new ideas. With such a framework,
entific computing on finance data. It greatly outperforms cur- researchers could try new ideas and test the overall perfor-
rent popular storage solutions like general-purpose databases mance with other modules with minimal cost.
and time-series databases on some typical data processing The modules of Qlib are listed in Figure 1 and connected
tasks in quantitative investment research. Furthermore, the in a typical workflow. Each module corresponds to a typi-
database provides an expression engine, which could acceler- cal sub-task in quantitative investment. A implementation in
ate the implementation and computation of factors/features, the module can be regarded as a solution for this task. We’ll
which make research topics that rely on expressions compu- introduce each module and give some related examples of ex-
tation possible. isting quantitative research to show how Qlib accommodate
them.
Guidance for machine learning Qlib has been integrated It starts with the Data Server module in the bottom left
with some typical datasets for quantitative investment, on corner, which provides a data engine to query and process
which typical machine learning algorithms could success- raw data. With retrieved data, researcher could build his
fully learn patterns with generalization ability. Qlib pro- own dataset in the Data Enhancement module. Researchers
vides some basic guidance for machine learning users and have tried a lot solutions to build better datasets by exploring
integrates some reasonable tasks which consist of reasonable and constructing effective factors/features[Potvin et al., 2004;
feature space and target label. Some typical hyperparameter Neely et al., 1997; Allen and Karjalainen, 1999; Kakushadze,
optimization tools are provided. With guidance and reason- 2016]. Generating datasets for training[Feng et al., 2019] is
able settings, machine learning models could learn patterns another research direction to provide datasets solution. The
with better generalization ability instead of just over-fitting Model Creater module learns models based on datasets. In
the noise. recent years, numerous researchers have explored all kinds of
models to mine trading signals from financial dataset[Sezer
3.2 AI-oriented Framework et al., 2019]. Moreover, meta-learning [Vilalta and Drissi,
Figure 1 shows the overall framework of Qlib. This frame- 2002] that tries to learn to learn provides a new learning
work aims to 1) accommodate the modern AI technology, 2) paradigm for the Model Creator module. Given plenty of
methods to model the financial data in a modern research
3
https://en.wikipedia.org/wiki/Flat-file database workflow, the model management system has become a nec-
essary part of the workflow. The Model Manager module new data is necessary. The formalized update operation is
is designed to handle such problems for modern quantita-
tive researchers. With diverse models, ensemble learning is BasicDataT =OldBasicDataT ∪ {xi,t,anew }
quite an effective way to enhance the performance and ro- BasicDataT +1 =BasicDataT ∪ {xi,T +1,a }
bustness of machine learning models, and it is frequently P oolT +1 =P oolT ∪ {poolt+1 }
used in the financial area[Qiu et al., 2014; Yang et al., 2017;
Zhao et al., 2017]. It is supported by Model Ensemble mod- User queries can be formalized as
ule. Portfolio Generator module aims to generate a port-
folio from trading signals output by models, which is known DataQuery = {xi,t,a |it ∈ poolt , poolt ∈ P oolquery
as portfolio management[Qian et al., 2007]. Barra [Sheikh, a ∈ Attrquery , timestart ≤ t ≤ timeend }
1996] provides the most popular solution for this task. With
the target portfolio, we provide a high-fidelity trading simu- which represents data query of some attributes of instruments
lator, Orders Executor module, to examine the performance in a specific time range in a specific pool.
of a strategy and Analyser modules to automatically analyze Such requirements are quite simple. Many off-the-shelf
the trading signals, portfolio and execution results. The Order open-source solutions support such operations. We classify
Executor module is designed as a responsive simulator rather them into three categories and list the popular implementa-
than a back-testing function, which could provide the infras- tions in each category.
tructure for some learning paradigm(e.g., RL) that requires • General-purpose database: MySQL[MySQL, 2001],
feedback of the environment produced by the Analyser mod- MongoDB[Chodorow, 2013]
ules.
The data in quantitative investment are in time-series for- • Time-series database: InfluxDB [Naqvi et al., 2017]
mat and updated by time. The size of in-sample dataset in- • Data file for scientific computing: Data organized
creases by time. A typical practice to leverage the new data is by numpy[Oliphant, 2006] array or pandas[McKinney,
to update our models regularly [Wang et al., 2019b]. Besides 2011] dataframe
better utilization of increasing in-sample data, dynamically The general-purpose database supports data with diverse
updating models [Yang et al., 2019] and trading strategies formats and structures. Besides, it provides lots of sophis-
[Wang et al., 2019a] will improve the performance further
ticated mechanisms, such as indexing, transaction, entity-
due to dynamic nature of the stock market[Adam et al., 2016]. relationship model, etc. Most of them add heavy dependen-
Therefore, it is obviously not the optimal solution to use a cies and unnecessary complexity to a specific task rather than
set of static model and trading strategies in Static Workflow. solving the key problems in a specific scenario. The time-
Dynamic updating of models and strategies is a important re- series database optimizes the data structures and queries for
search direction in quantitative investment. The modules in time-series data. But they are still not designed for quanti-
the Dynamic Modeling provide interfaces and infrastructure tative research, where the data are usually in compact array-
to accommodate such solutions. based format for scientific computation to take advantage of
hardware acceleration. It will save a great amount of time if
3.3 High Performance Infrastructure
the data keep the compact array-based format from the disk
Financial data to the end of clients without format transformation. How-
We’ll summarise the data requirements in quantitative re- ever, both general-purpose and time-series database store and
search in this section. In quantitative research, the most transfer the data in a different format for the general purpose,
frequently-used format of data follow such format which is inefficient for scientific computation.
Due to the inefficiency of databases, array-based data gain
BasicDataT = {xi,t,a } , i ∈ Inst, t ∈ T ime, a ∈ Attr popularity in the scientific community. Numpy array and pan-
where xi,t,a is the value of basic type(e.g. float, int), Inst das dataframe are the mainstream implementations in scien-
denotes the financial instruments set(e.g. stock, option, etc.), tific computation, which are often stored as HDF5 or pickle6
T ime denotes the timestampes set(e.g. trading days of stock on the disk. Data in such formats have light dependencies
market), Attr denotes the possible attributes set of an instru- and are very efficient for scientific computing. However, such
ment(e.g. open price, volume, market value), T denote the data are stored in a single file and hard to update or query.
latest timestamp of the data(e.g. the latest trading date). xi,t,a After an investigation of above storage solutions, we find
denotes the value of attribute a of instrument i at time t. none could fit the quantitative research scenario very well. It
Besides, instruments pools are necessary information to is necessary to design a customized solution for quantitative
specify a set of financial instruments which change over time research.
GOOGL instrument
Start timestamp Index value1 value 2 value N
Cache system
Time
Reference Fixed-width binary data
open.bin
1.Orgianl Binary Data 2.ExpressionCache 3.DatasetCache
Timestamp 1 Timestamp 2 Timestamp T
Cache for Saving Calculation Time Cache For Saving Combination Time
Update
instruments pools Update
Stocks
Stocks
Stocks
Expression
Dataset
Cache
Cache
sp500.txt
MSFT MSFT
GOOGL 2004-08-19
MSFT 1986-03-13 close.bin high/close
time time
AAPL 1980-12-12
At
open.bin GOOGL open/close GOOGL time
tr
calendar.txt shared timeline AMZN 1997-05-15
an
d
...
F(
At
At
F(
tr
tr)
At
tr)
Figure 2: The description of the flat-file database; the left part is the Figure 3: The disk cache system of Qlib; expression cache for saving
structure of files; the right part is the content of files time of expression computation; dataset cache for saving time of
data combination