0% found this document useful (0 votes)
106 views34 pages

Improving Our Database Service

This document proposes improvements to the author's database service to make it more flexible and efficient. It outlines changes to allow specifying start and end times or start time and duration to retrieve timeseries data. The author also proposes decoupling the data source, timeseries logic, and using a single-threaded model to improve performance. Benchmark results from the previous paper showing download times for 300 tickers are improved from 3 minutes to 1 minute.

Uploaded by

Kiran R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
106 views34 pages

Improving Our Database Service

This document proposes improvements to the author's database service to make it more flexible and efficient. It outlines changes to allow specifying start and end times or start time and duration to retrieve timeseries data. The author also proposes decoupling the data source, timeseries logic, and using a single-threaded model to improve performance. Benchmark results from the previous paper showing download times for 300 tickers are improved from 3 minutes to 1 minute.

Uploaded by

Kiran R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Improving Our Database Service

HangukQuant1, 2 *

October 9, 2022

1 2

Abstract

This paper builds upon the previous paper [1] to reduce latency of our database read writes.
We also improve the code structure and code functionality, and employ tips and tricks to improve
our database service by reducing idle time. We use a single process, single threaded concurrency
model, making no assumptions about the CPU resources available to the developer - and achieve
superior results to our previous discussions which utilized multiprocessing. Using the previous
benchmark (Downloading & Populating Full History for 300 tickers for Unpopulated DB with
Read and Write Capacity), our best result in the previous paper was 3 minutes. We improve
this to 1 minute.

1
*1: [email protected], hangukquant.substack.com
2
*2: DISCLAIMER: the contents of this work are not intended as investment, legal, tax or any other advice, and
is for informational purposes only. It is illegal to make unauthorized copies, forward to an unauthorized user or to
post this article electronically without express written consent by HangukQuant.

1
Contents

1 Introduction 3

2 Let’s Expand Our Functionalities 4

3 Let’s Decouple Our Datasource 7

4 Let’s Decouple Our Timeseries Logic 7

5 Generic Wrapper 9

6 Breaking Down the Code 15

7 Finishing Up with Our Database Service Class 23

7.1 Time Aware Datetimes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

7.2 Faster Schema Unrolls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

7.3 Optional Multi-Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

7.4 Removing Safety Checks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

8 Conclusion 25

A db service.py 26

2
1 Introduction

In the previous paper [1], we presented code with the following results:

AsyncIO Sessioning Batching Multi-Core Timing (M/S).

N N N N 14.5M
Y N N N 9.5M
Y Y Y N 9M
Y Y Y Y 3M

Table 1: Results of Our Previous Paper: Downloading & Populating Full History for
300 tickers for Unpopulated DB with Read and Write Capacity

However, this was still quite begging of a better solution. Upon code profiling (see Paper:
Flirting with CPUs [2] on code profiling), there were two expensive operations that primarily
slowed us down. Firstly, unrolling the dataframe received and pressing it into the desired schema
of dictionaries was a costly operation - and significantly slowed us down. Unfortunately, this is
an irreducible cost - there is not much we can do about the time taken to create database entries
to fit our schema, and the only way we dealt with this was to use multiple CPUs to do this
task. However, this improvement was merely a bandage around the core issue of poor utilization
of compute. Secondly, when submitting large batches, we attempted to get all of the data from
the API at once - and this caused the server to ask us to resubmit our requests after some time.
However, since this current ‘batch’ is not yet fully retrieved, we did not perform work in the
meantime while waiting to poll data. This inefficiency means that we can get an improvement by
alternating between the two tasks, between unrolling mini-batches of dataframes and waiting for
API credits.

This paper explores a clean code structure to do the task with the asyncio event loop that
essentially should put our CPU to work at 100%. We see that even without multiple processes,
we can shorten the run time down to 1M for the same benchmark. We will also improve the
functionality in our code so that we can employ different data polling engines, such as different
data libraries without changing the higher level interface of our data service.

3
As usual, the same caveat applies. You should not expect the papers to be error-free, or emulate
perfection in any sense of the word. Books are often written over a few years, with consultations
and revisions by domain experts, and then mulled over and re-written. This instead is a product
of weekly musings, with significantly less mullings. There will inevitably be mistakes, misnomers
and shortfalls: please email me at [email protected] if any. All mistakes are my own.

2 Let’s Expand Our Functionalities

The power of having our own database, as opposed to sticking to polling data from an API on
demand is that we can design our own custom requests, so long as our operations support it. We
will kill support for the synchronous functionalities, and focus on the asynchronous functionalities
provided, since there is not much reason for us to maintain them. Using coroutines give us the
advantage with working with light weight objects - threading takes much lesser resources than
multi-processing due to sharing of the same memory state-space, and coroutines take even lesser
overhead than threads. This is because we do not need to deal with context-switching and so on.
For demonstrative purposes, we will work with the most common functionality, which is to retrieve
timeseries data.

Recall our asyn batch get ohlcv functionality in our equities service class. It has the interface
of something like this:

1 async def asyn_batch_get_ohlcv (


2 self , tickers , exchanges , read_db , insert_db , granularity , period_end ,
period_start = None , duration = None
3 ):

Listing 1: equities.py

We were required to provide the end time of some timeseries, and then either the period start or
period days to get the window for the data that we want. Let’s modify it to become more flexible,
such that we can have more general behavior. For instance, specifying start and duration will give
us [start, start + duration], specifying nothing gives us all the data in our database and so on. The
logic will be as follows:

1 from datetime import datetime

4
2 from dateutil . relativedelta import relativedelta
3

4 class I n validPeriodConfig ( Exception ) :


5 pass
6

7 def g e t _ m o n g o _ d a t e t i m e_ f i l t e r ( period_start , period_end ) :


8 if not period_start and not period_end :
9 return {}
10 if not period_start and period_end :
11 return { " datetime " : { " $lte " : period_end }}
12 if period_start and not period_end :
13 return { " datetime " : { " $gte " : period_start }}
14 if period_start and period_end :
15 return { " datetime " : { " $gte " : period_start , " $lte " : period_end }}
16

17 def m a p _ d fr eq _to _f re que nc y ( dfreq ) :


18 """
19 frequency :: mongodb granularity
20 granularity :: internal granularity for timeseries data
21 dfreq :: frequency naming for timeseries connections - maps 1:1
granularity
22 >> tick , seconds , minutes , hours , days , Months , Years
23 """
24 assert ( dfreq in [ " t " , " s " , " m " , " h " , " d " , " M " , " Y " ])
25 if dfreq == " t " or dfreq == " s " :
26 return " seconds "
27 if dfreq == " m " :
28 return " minutes "
29 if dfreq == " h " or dfreq == " d " or dfreq == " M " or dfreq == " Y " :
30 return " hours "
31

32 def m a p _ g r a n u l a r i t y _ t o _ r e l a t i v e d e l t a ( granularity , duration ) :


33 if granularity == " s " :
34 deltatime = relativedelta ( seconds = duration )
35 if granularity == " m " :
36 deltatime = relativedelta ( minutes = duration )
37 if granularity == " h " :

5
38 deltatime = relativedelta ( hours = duration )
39 if granularity == " d " :
40 deltatime = relativedelta ( days = duration )
41 if granularity == " M " :
42 deltatime = relativedelta ( months = duration )
43 if granularity == " y " :
44 deltatime = relativedelta ( years = duration )
45 assert ( deltatime )
46 return deltatime
47

48 def get_span ( granularity , period_start = None , period_end = None , duration = None ) :


49 assert ( not duration or duration > 0)
50 assert ( not ( period_start and period_end ) or period_end > period_start )
51 deltatime = None
52 if duration :
53 deltatime = m a p _ g r a n u l a r i t y _ t o _ r e l a t i v e d e l t a ( granularity , duration )
54 if not period_start and not period_end and not duration :
55 return None , None
56 if not period_start and not period_end and duration :
57 raise InvalidPeriodConfig ( " cannot map period specifications to time window
")
58 if not period_start and period_end and not duration :
59 return None , period_end
60 if not period_start and period_end and duration :
61 return period_end - deltatime , period_end
62 if period_start and not period_end and not duration :
63 return period_start , None
64 if period_start and not period_end and duration :
65 return period_start , period_start + deltatime
66 if period_start and period_end and not duration :
67 return period_start , period_end
68 if period_start and period_end and duration :
69 raise InvalidPeriodConfig ( " cannot map period specifications to time window
")

Listing 2: utils/datetime utils.py

A None value represents an unbounded request, which means we just want to read the database

6
and get all the data relevant. We define an internal nomenclature for granularity, which defines the
granularity of the timeseries under consideration. Here, we support

"t", "s", "m", "h", "d", "M", "Y"

corresponding to tick, seconds, minute, hours, days, months and year. This utility function allows
us to ‘decode’ the amount of data that the caller of the equity service class wants. We shall then
map our internal granularities to the nomenclature accepted by an external API, as well as the
frequency of timeseries that our MongoDB API recognises.

3 Let’s Decouple Our Datasource

We also want to decouple the actual equity service class from the data provider. Previously, we
made the calls to eodhistoricaldata API directly inside the equity service class. But what if we
want to include multiple different data providers, such as Yahoo Finance, MetaTrader terminal,
and Interactive Brokers?

4 Let’s Decouple Our Timeseries Logic

Does getting equities data timeseries fundamentally differ from getting tick data or macroeconomic
GDP data? Not really - the general structure is that we want to first get the data from the time
series database, find which data periods we are missing data from, make those requests using an
API if desired and return the result to the caller. The function call to get OHLCV should be no
different from get record of rain volume.

Our new equities service class and get OHLCV function will look something like the following:

1 from data_service . wrappers import eod_wrapper


2 from data_service . wrappers . generic_wrapper import a s y n _ b a t ch _ g e t _ t i m e s e r i e s
3

4 class Equities () :
5

6 dtype = " equities "


7

7
8 def __init__ ( self , data_clients ={} , db_service = None ) :
9 self . data_clients = data_clients
10 self . db_service = db_service
11

12 async def asyn_get_ohlcv (


13 self , ticker , exchange , read_db , insert_db , granularity , engine ,
period_start = None , period_end = None , duration = None
14 ):
15 return await self . asyn_batch_get_ohlcv (
16 tickers = [ ticker ] ,
17 exchanges = [ exchange ] ,
18 read_db = read_db ,
19 insert_db = insert_db ,
20 granularity = granularity ,
21 engine = engine ,
22 period_start = period_start ,
23 period_end = period_end ,
24 duration = duration
25 ) [0]
26

27 async def asyn_batch_get_ohlcv (


28 self , tickers , exchanges , read_db , insert_db , granularity , engine ,
period_start = None , period_end = None , duration = None , chunksize =100
29 ):
30 """
31 gets data from database or relevant datasource
32 specify period_start , period_end , duration :: see logic in datetime_utils . py
33 if period specifications are unbounded , such as ( - , -) | ( start , -) | ( - ,
end ) then
34 data only returned from database when read_db enabled , else return
empty results
35 """
36 assert ( engine in [ " eodhistoricaldata " , " DWX - MT5 " ])
37

38 if engine == " eodhistoricaldata " :


39 datapoller = eod_wrapper
40 dformat = " spot "

8
41 elif engine == " DWX - MT5 " :
42 datapoller = await self
43 . data_clients [ " meta_client " ]
44 . g e t _m e t a ap i _ d at a _ m an a g e r ()
45 dformat = " CFD "
46

47 return await a s y n _ b a t c h _ ge t _ t i m e s e r i e s (
48 tickers = tickers ,
49 exchanges = exchanges ,
50 read_db = read_db ,
51 insert_db = insert_db ,
52 granularity = granularity ,
53 db_service = self . db_service ,
54 datapoller = datapoller ,
55 engine = engine ,
56 dtype = Equities . dtype ,
57 dformat = dformat ,
58 period_start = period_start ,
59 period_end = period_end ,
60 duration = duration ,
61 chunksize = chunksize ,
62 results =[] ,
63 tasks =[] ,
64 batch_id =0
65 )

Listing 3: equities.py

This is a much cleaner structure! For each datasource we want to add to our service class,
all we have to do is specify the engine, provide a datapoller object and specify the contract type,
whether it is ‘spot’, ‘CFD’, ‘futures’ and so on.

5 Generic Wrapper

So what does the generic wrapper look like? It implements the asyn batch get timeseries logic,
and is only aware that it is getting some timeseries data. Recall that we want to perform the

9
data schema unrolling and writing in between API requests. The former will inevitably take CPU
compute, while the API requests need to be spaced out to prevent overloading the API server and
earning us a TooManyRequests status code.

We shall present the function - this is the hardest part of the paper, and we will go through
each section in more detail. I promise you this will be a headache, but pain is good. First, the
whole function:

1 import asyncio
2 import pandas as pd
3

4 from data_service . utils . datetime_utils import get_span


5 from collections import defaultdict
6

7 async def a s y n _ b a t c h _ g e t _ t i m e s e ri e s (
8 tickers , exchanges , read_db , insert_db , granularity ,
9 db_service , datapoller , engine , dtype , dformat ,
10 period_start = None , period_end = None , duration = None ,
11 chunksize =100 , results =[] , tasks =[] , batch_id =0
12 ):
13 print ( f " START BATCH { batch_id } " )
14 assert ( engine in [ " eodhistoricaldata " , " DWX - MT5 " ])
15 if not tickers :
16 print ( " gathering results " )
17 await asyncio . gather (* tasks )
18 return results
19

20 temp_tickers = list ( tickers )


21 temp_exchanges = list ( exchanges )
22 tickers = tickers [: chunksize ]
23 exchanges = exchanges [: chunksize ]
24

25 series_metadatas = None
26

27 if engine == " eodhistoricaldata " :


28 series_metadatas = [
29 {
30 " ticker " : tickers [ i ] ,

10
31 " source " : f " eodhistoricaldata -{ exchanges [ i ]} "
32 }
33 for i in range ( len ( tickers ) )
34 ]
35 if engine == " DWX - MT5 " :
36 series_metadatas = [
37 {
38 " ticker " : tickers [ i ] ,
39 " source " : engine
40 }
41 for i in range ( len ( tickers ) )
42 ]
43

44 assert ( series_metadatas )
45 se ri es_identifiers = [{**{ " type " : " ticker_series " } , ** series_metadata } for
series_metadata in series_metadatas ]
46

47 period_start , period_end = get_span (


48 period_start = period_start ,
49 period_end = period_end ,
50 duration = duration ,
51 granularity = granularity
52 )
53 result_dfs = []
54 if read_db :
55 result_ranges , result_dfs = await db_service . a s y n _ b a t c h _ r e a d _ t i m e s e r i e s (
56 dtype = dtype ,
57 dformat = dformat ,
58 dfreq = granularity ,
59 period_start = period_start ,
60 period_end = period_end ,
61 series_metadatas = series_metadatas ,
62 series_identifiers = series_identifiers ,
63 metalogs = batch_id
64 )
65 if not period_start or not period_end :
66 return result_dfs if result_dfs else [ pd . DataFrame () for _ in range ( len (

11
tickers ) ) ]
67

68 requests = defaultdict ( list )


69

70 for i in range ( len ( tickers ) ) :


71 result_start , result_end = result_ranges [ i ]
72 if not result_start and not result_end :
73 requests [ tickers [ i ]]. append (
74 {
75 " period_start " : period_start ,
76 " period_end " : period_end ,
77 " exchange " : exchanges [ i ]
78 }
79 )
80 continue
81 assert ( result_start and result_end )
82

83 if period_start < result_start :


84 requests [ tickers [ i ]]. append (
85 {
86 " period_start " : period_start ,
87 " period_end " : result_start ,
88 " exchange " : exchanges [ i ]
89 }
90 )
91 if period_end > result_end :
92 requests [ tickers [ i ]]. append (
93 {
94 " period_start " : result_end ,
95 " period_end " : period_end ,
96 " exchange " : exchanges [ i ]
97 }
98 )
99

100 request_tickers = []
101 req uest_exchanges = []
102 request_starts , request_ends = [] , []

12
103 request_metadatas , request_identifiers = [] , []
104 for i in range ( len ( tickers ) ) :
105 ticker = tickers [ i ]
106 v = requests [ ticker ]
107 request_tickers . extend ([ ticker ] * len ( v ) )
108 request_exchanges . extend ([ spec [ " exchange " ] for spec in v ])
109 request_starts . extend ([ spec [ " period_start " ] for spec in v ])
110 request_ends . extend ([ spec [ " period_end " ] for spec in v ])
111 request_metadatas . extend ([ series_metadatas [ i ]] * len ( v ) )
112 r equest_identifiers . extend ([ series_identifiers [ i ]] * len ( v ) )
113

114 request_results = await datapoller . asyn_batch_get_ohlcv (


115 tickers = request_tickers ,
116 exchanges = request_exchanges ,
117 period_starts = request_starts ,
118 period_ends = request_ends ,
119 granularity = granularity
120 )
121 j = 0
122 ohlcvs = []
123 for i in range ( len ( tickers ) ) :
124 result_start , result_end = result_ranges [ i ]
125 db_df = result_dfs [ i ]
126 if not result_start and not result_end :
127 ohlcvs . append ( request_results [ j ])
128 j += 1
129 continue
130 head_df , tail_df = pd . DataFrame () , pd . DataFrame ()
131 if period_start < result_start :
132 head_df = request_results [ j ]
133 j += 1
134 if period_end > result_end :
135 tail_df = request_results [ j ]
136 j += 1
137 concat_dfs = [ head_df , db_df , tail_df ]
138 df = pd
139 . concat ( concat_dfs , axis =0)

13
140 . drop_duplicates ( " datetime " )
141 . reset_index ( drop = True )
142 ohlcvs . append ( df )
143 assert ( j == len ( request_results ) )
144

145 if insert_db :
146 insert_tickers = [
147 tickers [ i ] for i in range ( len ( tickers ) ) if not ohlcvs [ i ]. empty
148 ]
149 insert_ohlcvs = [
150 ohlcvs [ i ] for i in range ( len ( tickers ) ) if not ohlcvs [ i ]. empty
151 ]
152 i ns er t _s e ri es _ me t ad at a s = [
153 series_metadatas [ i ] for i in range ( len ( tickers ) ) if not ohlcvs [ i ].
empty
154 ]
155 i n s e r t _ s e r i es _ i d e n t i f i e r s = [
156 series_identifiers [ i ] for i in range ( len ( tickers ) ) if not ohlcvs [ i ].
empty
157 ]
158

159 task = asyncio . create_task (


160 db_service . a s y n _ b a t c h _ i n s e r t _ t i m e s e r i e s _ d f (
161 dtype = dtype ,
162 dformat = dformat ,
163 dfreq = granularity ,
164 dfs = insert_ohlcvs ,
165 series_identifiers = insert_series_identifiers ,
166 series_metadatas = insert_series_metadatas ,
167 metalogs = batch_id
168 )
169 )
170 tasks . append ( task )
171 await asyncio . sleep (0)
172 results . extend ( ohlcvs )
173

174 return await a s y n _ b a t c h _g e t _ t i m e s e r i e s (

14
175 temp_tickers [ chunksize :] , temp_exchanges [ chunksize :] , read_db , insert_db ,
176 granularity , db_service , datapoller , engine , dtype , dformat , period_start ,
period_end ,
177 duration , chunksize , results , tasks , batch_id +1
178 )

Listing 4: wrappers/generic wrapper.py

6 Breaking Down the Code

The asyn batch get timeseries function process the batches in size chunksize. Upon entry, it takes
the first chunksize tickers and exchanges (these variables might be badly named, since we are
handling general timeseries data!) and processes it. Then, if database write is required (which
has computationally costly schema creations), we spawn a task to put on the asyncio event loop.
However, we do not wait for the database write to return - instead we call the next mini-batch with
a recursive call, excluding the first chunksize work. The results parameter gets built up down the
call stack, and the database write tasks put on the event loop are collected in the tasks parameter.
These tasks are on the event loop, and whenever our datapoller are waiting for API requests over
the network, the pending tasks opportunistically grab on to the CPU compute to create our data
schema and perform database writes. Eventually, as the work remaining decreases down the call
stack, we will reach our last mini batch of size lesser than or equal to the chunksize. We will run
out of tickers to process, and this is our base case:
1 if not tickers :
2 print ( " gathering results " )
3 await asyncio . gather (* tasks )
4 return results

which waits for all the previous database write tasks to complete. The results that have been
built up are then returned back up the call stack to the caller. The code section after creates the
supporting schema based on the data source of choice, so that we can see where our populated
database got its results from.

Next, we have
1 result_dfs = []

15
2 if read_db :
3 result_ranges , result_dfs = await db_service . a s y n _ b a t c h _ r e a d _ t i m e s e r i e s (
4 dtype = dtype ,
5 dformat = dformat ,
6 dfreq = granularity ,
7 period_start = period_start ,
8 period_end = period_end ,
9 series_metadatas = series_metadatas ,
10 s eries_identifiers = series_identifiers ,
11 metalogs = batch_id
12 )
13 if not period_start or not period_end :
14 return result_dfs if result_dfs else [ pd . DataFrame () for _ in range ( len (
tickers ) ) ]

This essentially reads from the database if we enable reads. The return values are result ranges,
which is pair containing the start of data read and end of data read. If the data request is unbounded,
we return the retrieved dataframes back to the caller without seeking any data from the external
API. This is the behavior we discussed in get span in the previous section! We can also get a (None,
None) as the range, which specifies the wanted data was not found in the database.

Next, depending on the data we do not have that we want, we can build up request parameters
that we need. This is done by the following code section:

1 requests = defaultdict ( list )


2

3 for i in range ( len ( tickers ) ) :


4 result_start , result_end = result_ranges [ i ]
5 if not result_start and not result_end :
6 # no data , request all
7 requests [ tickers [ i ]]. append (
8 {
9 " period_start " : period_start ,
10 " period_end " : period_end ,
11 " exchange " : exchanges [ i ]
12 }
13 )
14 continue

16
15 assert ( result_start and result_end )
16

17 if period_start < result_start :


18 requests [ tickers [ i ]]. append (
19 {
20 " period_start " : period_start ,
21 " period_end " : result_start ,
22 " exchange " : exchanges [ i ]
23 }
24 )
25 if period_end > result_end :
26 requests [ tickers [ i ]]. append (
27 {
28 " period_start " : result_end ,
29 " period_end " : period_end ,
30 " exchange " : exchanges [ i ]
31 }
32 )
33

34 request_tickers = []
35 req uest_exchanges = []
36 request_starts , request_ends = [] , []
37 request_metadatas , request_identifiers = [] , []
38 for i in range ( len ( tickers ) ) :
39 ticker = tickers [ i ]
40 v = requests [ ticker ]
41 request_tickers . extend ([ ticker ] * len ( v ) )
42 req uest_exchanges . extend ([ spec [ " exchange " ] for spec in v ])
43 request_starts . extend ([ spec [ " period_start " ] for spec in v ])
44 request_ends . extend ([ spec [ " period_end " ] for spec in v ])
45 req uest_metadatas . extend ([ series_metadatas [ i ]] * len ( v ) )
46 r eq u e st_identifiers . extend ([ series_identifiers [ i ]] * len ( v ) )

Next, our datapoller uses the built up requests to ask for data required. We do not know
about the nature of the datapoller, except that if implements the method asyn batch get ohlcv.
This datapoller could be a Yahoo Finance API wrapper, or some Quandl wrapper and so on.

1 request_results = await datapoller . asyn_batch_get_ohlcv (

17
2 tickers = request_tickers ,
3 exchanges = request_exchanges ,
4 period_starts = request_starts ,
5 period_ends = request_ends ,
6 granularity = granularity
7 )

The remainder
1 j = 0
2 ohlcvs = []
3 for i in range ( len ( tickers ) ) :
4 result_start , result_end = result_ranges [ i ]
5 db_df = result_dfs [ i ]
6 if not result_start and not result_end :
7 ohlcvs . append ( request_results [ j ])
8 j += 1
9 continue
10 head_df , tail_df = pd . DataFrame () , pd . DataFrame ()
11 if period_start < result_start :
12 head_df = request_results [ j ]
13 j += 1
14 if period_end > result_end :
15 tail_df = request_results [ j ]
16 j += 1
17 concat_dfs = [ head_df , db_df , tail_df ]
18 df = pd . concat ( concat_dfs , axis =0) . drop_duplicates ( " datetime " ) . reset_index (
drop = True )
19 ohlcvs . append ( df )

‘stitches’ together the relevant data timeseries using the database result, the requested head, and
the requested tail to get all the available data from period start to period end.

Next, if we are required to make inserts, we do


1 task = asyncio . create_task (
2 db_service . a s y n _ b a t c h _ i n s e r t _ t i m e s e r i e s _ d f (
3 dtype = dtype ,
4 dformat = dformat ,
5 dfreq = granularity ,

18
6 dfs = insert_ohlcvs ,
7 s eries_identifiers = insert_series_identifiers ,
8 series_metadatas = insert_series_metadatas ,
9 metalogs = batch_id
10 )
11 )

This takes the coroutine asyn batch insert timeseries df from our database service object and wraps
it in a task. Doing an await here on the coroutine directly would prevent us from moving to the
next batch, since the coroutine blocks on await. Instead, we need to wrap it in a task and schedule
it on the event loop. However, note that task creation does not trigger the event loop. We need to
defer to the event loop with await asyncio.sleep(0) which triggers an iteration of the event loop,
and the task is now waiting to run. The pending tasks are built up with tasks.append(task) and
results are built up with results.extend(ohlcvs), which is passed on the next recursive call in

1 return await a s y n _ b a t c h _g e t _ t i m e s e r i e s (
2 temp_tickers [ chunksize :] , temp_exchanges [ chunksize :] , read_db , insert_db ,
3 granularity , db_service , datapoller , engine , dtype , dformat , period_start ,
period_end ,
4 duration , chunksize , results , tasks , batch_id +1

For those who are not familiar with recursion in programming, this might not make any sense.
Take the red pill and catch up on it.

While that was uncomfortably long function to comprehend, the nice thing about this is it
that it should be pretty invariant to future changes! We should be able to extend new dat-
apollers and make minimal code changes, as long as the datapoller implements the ‘contract’
asyn batch get ohlcv. An even cleaner structure would be to enforce this contract with the use of
abstract base classes for a Datapoller interface - which we will leave for another day. Other service
classes such as the commodities service class and fx service class should use the same generic wrap-
per, making the service class implementation modularised and without concern of how timeseries
logic is implemented internally.

Now, we just need to know how datapollers work and review the read write functionality in
our database service object. There are some improvements we want to make, in terms of code
structure and database integrity. The first is that we want to remove redundant code - recall that

19
we implemented two separate functionalities, one for getting in batch (which is more efficient due
to the use of session-ing network requests), and one for getting a single result. The batch is just a
generalization of the singular request, where the batch is a singleton. We can hence focus all our
attention on the batch functions and let the singleton request call the batch request.

Secondly, we want to resolve the issue of different timezones of the data received. Unfortunately,
when retrieving data from the Mongo database our submitted data, the properties of timezone
awareness are lost. Hence, we want to ensure that our database series only accepts datetime
objects in timezone-aware state and immediately does the conversion on retrieval. Passing in a
timezone naive timeseries dataframe would be unsuccessful, and we would maintain the integrity
of our data in this way.

1 import os
2 import pytz
3 import json
4 import aiohttp
5 import requests
6 import calendar
7 import numpy as np
8 import pandas as pd
9 import urllib
10 import urllib . parse
11 import pytz
12

13 from datetime import datetime


14 from datetime import timedelta
15

16 import data_service . wrappers . aiohttp_wrapper as aiohttp_wrapper


17

18 async def asyn_get_ohlcv ( ticker , exchange , period_start , period_end , granularity ) :


19 return await asyn_batch_get_ohlcv (
20 tickers =[ ticker ] ,
21 exchanges =[ exchange ] ,
22 period_starts =[ period_start ] ,
23 period_ends =[ period_end ] ,
24 granularity = granularity
25 ) [0]

20
26

27 async def asyn_batch_get_ohlcv ( tickers , exchanges , period_starts , period_ends ,


granularity ) :
28 """
29 Returns dataframe for tickers , exchanges in order from start to end with
columns datetime in utc
30 Returns empty dataframe if no result between period_start and period_end
31 """
32 assert ( granularity == " d " or granularity == " w " )
33 urls = []
34 results = []
35 for ticker , exchange , period_start , period_end in zip ( tickers , exchanges ,
period_starts , period_ends ) :
36 params = {
37 " api_token " : os . getenv ( ’ EOD_KEY ’) ,
38 " fmt " : " json " ,
39 " from " : period_start . strftime ( ’%Y -% m -% d ’) ,
40 " to " : period_end . strftime ( ’%Y -% m -% d ’) ,
41 " period " : granularity
42 }
43 url = f " https :// eodhistoricaldata . com / api / eod /{ ticker }.{ exchange }? "
44 urls . append ( url + urllib . parse . urlencode ( params ) )
45 results = await aiohttp_wrapper . a sync_ aioh ttp_g et_a ll ( urls )
46 dfs = []
47 for result in results :
48 if result == {}:
49 dfs . append ( pd . DataFrame () )
50 continue
51 df = pd . DataFrame ( result ) . rename ( columns ={ " date " : " datetime " , "
adjusted_close " : " adj_close " })
52 if len ( df ) > 0: df [ " datetime " ] = pd . to_datetime ( df [ " datetime " ]) . dt .
tz_localize ( pytz . utc )
53 dfs . append ( df )
54 return dfs

Listing 5: wrappers/eod wrapper.py

1 import aiohttp

21
2 import asyncio
3 import data_service . db_logs as db_logs
4

5 async def asyn c_aio http _get_ all ( urls , fmt = " json " , max_tries =10 , sleep_seconds =3) :
6 async with aiohttp . ClientSession () as session :
7 async def fetch ( url , tries =0) :
8 async with session . get ( url ) as response :
9 db_logs . DBLogs () . debug ( " try {} {} " . format ( url , tries ) )
10 assert ( fmt == " json " ) # only json format accepted for now
11

12 if response . status == 200:


13 return await response . json ()
14 elif response . status == 429:
15 if tries < max_tries :
16 print ( f " trying { url } - { tries + 1} / { max_tries } , sleep {
sleep_seconds } s " )
17 await asyncio . sleep ( sleep_seconds )
18 return await fetch ( url , tries +1)
19 return {}
20 else :
21 print ( response . status )
22 print ( response . text )
23 exit ()
24

25 return await asyncio . gather (*[ fetch ( url ) for url in urls ] ,
ret urn_exceptions = True )

Listing 6: wrappers/aiohttp wrapper.py

Note that this time, even if we hit the code sections:

elif response.status == 429:


if tries < max_tries:
print(f"trying {url} - {tries + 1} / {max_tries}, sleep {sleep_seconds}s")
await asyncio.sleep(sleep_seconds)
return await fetch(url, tries+1)
return {}

22
the sleep is not a big issue since we are not idling anymore - we just give up the current coroutine
execution but the event loop instead runs the database write tasks which were queued up on the
event loop before! The CPU is still under compute and no resources are wasted.

7 Finishing Up with Our Database Service Class

We just need to finish up by writing the code for timeseries functions in our database service class.
Since the logic is very similar to the previous paper, we just provide the code in the Appendix A
but do not elaborate. We will only highlight a few key points. Since it is rather lengthy, we will
provide the code files for generic wrapper.py and db service.py for your convenience. Note that
these are not the full code files and only contain the functionalities discussed in this paper!

7.1 Time Aware Datetimes

The cutoff is now made timeaware, see line

earliest_utc_cutoff = datetime(1970, 1, 31, tzinfo=pytz.utc)

When inserting new data, the datetime column is now compared to this cutoff

dfs = [df.loc[df["datetime"] >= DbService.earliest_utc_cutoff] for df in dfs]

which throws an Exception if our datetime is not timezone-aware. This helps in the database
integrity.

7.2 Faster Schema Unrolls

@staticmethod
def unroll_df(df, metadata):
records = df.to_dict(orient="records")
for record in records: record.update({"metadata" : metadata})
return records

23
The previous schema unrolls required creating new dictionaries. This is costly! Profiling the
code shows that updating by adding new key-value pairs to an existing dictionary showed us that
this is faster than rehashing a dictionary.

7.3 Optional Multi-Processing

@staticmethod
async def pool_unroll_df(dfs, metadatas, use_pool=False):
if not use_pool:
return [DbService.unroll_df(df, metadata) for df, metadata
in zip(dfs, metadatas)]

with ProcessPoolExecutor() as process_pool:


loop = asyncio.get_running_loop()
calls = [
partial(DbService.unroll_df, df, metadata)
for df, metadata in zip(dfs, metadatas)
]
results = await asyncio.gather(
*[loop.run_in_executor(process_pool, call) for call in calls]
)
return results

If the dataframes are large enough, it is worth the overhead to do the unrolling using new
processes. We can do this using a process pool, and hooking it to an executor with the asyncio
module.

7.4 Removing Safety Checks

One of the slowest tasks was to do collection checks.

def _ensure_coll(self, dtype, dformat, dfreq, coll_type, force_check=True):

24
This is because we made the function synchronous due to the possibility of race conditions
(even in a single threaded concurrency model! - see the previous paper why this is so) We add
the option to disable checks, and can be used for projects where the different collection types have
already matured.

The remainder of the logic is fairly similar, with the addition of timezone awareness on data
retrieved from the database.

8 Conclusion

We have managed to work through some examples - that required some technical depth and knowl-
edge of the underlying Python mechanism. In the previous paper, we applied well known, common
tools to try to make our code performant. These tools could be applied with a general distinguish-
ing between CPU intensive and I/O tasks. While they helped (and were rather easy to employ), a
deeper understanding of how the different tasks are handled and bottlenecks revealed we were still
idling serious computing power. Using asyncio, we showed how to handle this efficiency. We also
improved our database functionality to be more flexible to adopt more use cases without increasing
complexity. This was achieved with better standards in separation of concerns.

25
References

[1] HangukQuant. Design and Implementation of a Quant Database. https://hangukquant.


substack.com/p/design-and-implementation-of-a-quant.

[2] HangukQuant. Flirting with CPUs - Advanced Backtesting in Python. https://


hangukquant.substack.com/p/flirting-with-cpus-advanced-backtesting-c97.

A db service.py

1 import os
2 import sys
3 import json
4 import pytz
5 import motor
6 import asyncio
7 import pathlib
8 import pymongo
9 import pandas as pd
10 import motor . motor_asyncio
11 import data_service . db_logs as db_logs
12

13 from functools import partial


14 from datetime import timedelta
15 from datetime import datetime
16 from pymongo import UpdateOne
17 from concurrent . futures import ProcessPoolExecutor
18

19 from data_service . utils . datetime_utils import g e t _ m o n g o_ d a t e t i m e _ f i l t e r


20 from data_service . utils . datetime_utils import m ap _df re q_ to _fr eq ue ncy
21

22 class DbService () :
23

24 e ar l i est_utc_cutoff = datetime (1970 , 1 , 31 , tzinfo = pytz . utc )


25

26 def __init__ ( self , db_config_path = str ( pathlib . Path ( __file__ ) . parent . resolve () )

26
+ " / config . json " ) :
27 with open ( db_config_path , " r " ) as f :
28 config = json . load ( f )
29 os . environ [ ’ MONGO_CLUSTER ’] = config [ " mongo_cluster " ]
30 os . environ [ ’ MONGO_DB ’] = config [ " mongo_db " ]
31 self . mongo_cluster = pymongo . MongoClient ( os . getenv ( " MONGO_CLUSTER " ) )
32 self . mongo_db = self . mongo_cluster [ os . getenv ( " MONGO_DB " ) ]
33 self . asyn_mongo_cluster = motor
34 . motor_asyncio
35 . AsyncIOMotorClient ( os . getenv ( " MONGO_CLUSTER " ) )
36 self . asyn_mongo_db = self . asyn_mongo_cluster [ os . getenv ( " MONGO_DB " ) ]
37

38 @staticmethod
39 def unroll_df ( df , metadata ) :
40 records = df . to_dict ( orient = " records " )
41 for record in records : record . update ({ " metadata " : metadata })
42 return records
43

44 @staticmethod
45 async def pool_unroll_df ( dfs , metadatas , use_pool = False ) :
46 if not use_pool :
47 return [ DbService . unroll_df ( df , metadata ) for df , metadata
48 in zip ( dfs , metadatas ) ]
49

50 with ProcessPoolExecutor () as process_pool :


51 loop = asyncio . get_running_loop ()
52 calls = [
53 partial ( DbService . unroll_df , df , metadata )
54 for df , metadata in zip ( dfs , metadatas )
55 ]
56 results = await asyncio . gather (
57 *[ loop . run_in_executor ( process_pool , call ) for call in calls ]
58 )
59 return results
60

61 @staticmethod
62 def m a t c h _ i d e n t i f i e r s _t o _ d o c s ( identifiers , docs ) :

27
63 records = []
64 for identifier in identifiers :
65 matched = []
66 for doc in docs :
67 if identifier . items () <= doc . items () :
68 matched . append ( doc )
69 records . append ( matched )
70 return records
71

72 def _get_coll_name ( self , dtype , dformat , dfreq ) :


73 return " {} _ {} _ {} " . format ( dtype , dformat , dfreq )
74

75 async def _asyn_get_collection ( self , dtype , dformat , dfreq ) :


76 return self . asyn_mongo_db [ self . _get_coll_name ( dtype , dformat , dfreq ) ]
77

78 async def _ a s y n _ g e t _ c o l l e c t i o n _ me t a ( self , dtype , dformat , dfreq ) :


79 return self . asyn_mongo_db [
80 " {} - meta " . format ( self . _get_coll_name ( dtype , dformat , dfreq ) )
81 ]
82

83 def _ensure_coll ( self , dtype , dformat , dfreq , coll_type , force_check = True ) :


84 assert ( coll_type == " timeseries " or coll_type == " regular " )
85 if not force_check : return True
86 names = self . mongo_db . list _coll ecti on_na mes (
87 filter ={ " name " : { " $regex " : r " ^(?! system \.) " }}
88 )
89 exists = self . _get_coll_name (
90 dtype = dtype , dformat = dformat , dfreq = dfreq ) in names
91

92 if not exists and coll_type == " timeseries " :


93 frequency = ma p_ dfr eq _t o_ fre qu en cy ( dfreq = dfreq )
94 assert ( frequency == " seconds " or
95 frequency == " minutes " or
96 frequency == " hours "
97 )
98

99 self . mongo_db . create_collection (

28
100 " {} _ {} _ {} " . format ( dtype , dformat , dfreq ) ,
101 timeseries ={
102 ’ timeField ’: ’ datetime ’ ,
103 ’ metaField ’: ’ metadata ’ ,
104 ’ granularity ’: frequency
105 },
106 check_exists = True
107 )
108 self . mongo_db . drop_collection ( " {} _ {} _ {} - meta " . format ( dtype , dformat ,
dfreq ) )
109 self . mongo_db . create_collection (
110 " {} _ {} _ {} - meta " . format ( dtype , dformat , dfreq )
111 )
112 if not exists and coll_type == " regular " :
113 self . mongo_db . create_collection (
114 " {} _ {} _ {} " . format ( dtype , dformat , dfreq ) ,
115 check_exists = True
116 )
117 return True
118

119 def _ c he c k _ co n t i gu o u s _s e r i es ( self , record_start , record_end , new_start ,


new_end ) :
120 return new_start <= record_end and record_start <= new_end
121

122 """
123 SECTION :: TIMESERIES
124 """
125 async def a s y n _ i n s e r t _ t i m e s e r i e s_ d f ( self , dtype , dformat , dfreq , df ,
series_metadata , series_identifier , metalogs = " " ) :
126 return await self . a s y n _ b a t c h _ i n s e r t _ t i m e s e r i e s _ d f (
127 dtype = dtype ,
128 dformat = dformat ,
129 dfreq = dfreq ,
130 dfs =[ df ] ,
131 series_metadatas =[ series_metadata ] ,
132 series_identifiers =[ series_identifier ] ,
133 metalogs =[ metalogs ]

29
134 )
135

136 async def a s y n _ b a t c h _ i n s e r t _ t i m e s e r i e s _ d f (


137 self , dtype , dformat , dfreq , dfs , series_metadatas , series_identifiers ,
metalogs =[]
138 ):
139 print ( f " START BATCH INSERTING ID { metalogs } " )
140 dfs = [ df . loc [ df [ " datetime " ] >= DbService . earliest_utc_cutoff ] for df in
dfs ]
141 assert ( all ([ len ( df ) > 0 for df in dfs ]) )
142 self . _ensure_coll ( dtype = dtype , dformat = dformat , dfreq = dfreq , coll_type = "
timeseries " )
143 series_starts = [
144 datetime . fromtimestamp (
145 int ( df [ " datetime " ]. values [0]) /1 e9 , tz = pytz . utc
146 )
147 for df in dfs
148 ]
149 series_ends = [
150 datetime . fromtimestamp (
151 int ( df [ " datetime " ]. values [ -1]) /1 e9 , tz = pytz . utc
152 )
153 for df in dfs
154 ]
155

156 docs = await ( await self . _ a s y n _ g e t_ c o l l e c t i o n _ m e t a ( dtype , dformat , dfreq ) )


. find ({ " $or " : series_identifiers }) . to_list ( length = None )
157

158 meta_records = DbService . m a t c h _ i d e n t i fi e r s _ t o _ d o c s ( series_identifiers ,


docs )
159 n ew_inserts_seriess = []
160 n e w_ i n s er t s _ se r i e s_ m e t as = []
161 n e w_ u p d at e s _ se r i e s_ m e t as = []
162 update_identifers , new_heads , new_tails , time_starts , time_ends = [] , [] ,
[] , [] , []
163 for i in range ( len ( series_identifiers ) ) :
164 matched = meta_records [ i ]

30
165 if len ( matched ) == 0:
166 doc = {
167 ** series_identifiers [ i ] ,
168 **{
169 " time_start " : series_starts [ i ] ,
170 " time_end " : series_ends [ i ] ,
171 " last_updated " : datetime . now ( pytz . utc )
172 }
173 }
174 records = DbService . unroll_df ( dfs [ i ] , series_metadatas [ i ])
175 new_inserts_seriess . extend ( records )
176 n e w_ i n s er t s _ se r i e s_ m e ta s . append ( doc )
177 elif len ( matched ) == 1:
178 meta_record = matched [0]
179 meta_start = pytz . utc . localize ( meta_record [ " time_start " ])
180 meta_end = pytz . utc . localize ( meta_record [ " time_end " ])
181 contiguous_series = self . _c h e c k_ c o n ti g u o us _ s e ri e s (
182 record_start = meta_start ,
183 record_end = meta_end ,
184 new_start = series_starts [ i ] ,
185 new_end = series_ends [ i ]
186 )
187 if not contiguous_series :
188 db_logs
189 . DBLogs ()
190 . error ( f " { sys . _getframe () . f_code . co_name } got
discontiguous series ::{ metalogs [ i ]} " )
191

192 continue
193 new_head = dfs [ i ]. loc [ dfs [ i ][ " datetime " ] < meta_start ]
194 new_tail = dfs [ i ]. loc [ dfs [ i ][ " datetime " ] > meta_end ]
195 if len ( new_head ) + len ( new_tail ) > 0:
196 new_heads . append ( new_head )
197 new_tails . append ( new_tail )
198 time_starts . append ( min ( series_starts [ i ] , meta_start ) )
199 time_ends . append ( max ( series_ends [ i ] , meta_end ) )
200 update_identifers . append ( series_identifiers [ i ])

31
201 else :
202 db_logs . DBLogs () . error ( f " { sys . _getframe () . f_code . co_name } got meta
series corruption , series count gt 1::{ metalogs [ i ]} " )
203 exit ()
204

205 unrolled_heads = await DbService . pool_unroll_df ( new_heads ,


upd ate_identifers )
206 unrolled_tails = await DbService . pool_unroll_df ( new_tails ,
upd ate_identifers )
207

208 for head_records in unrolled_heads :


209 new_inserts_seriess . extend ( head_records )
210 for tail_records in unrolled_tails :
211 new_inserts_seriess . extend ( tail_records )
212 for i in range ( len ( new_heads ) ) :
213 n e w_ u p d at e s _ se r i e s_ m e t as . append ( UpdateOne (
214 update_identifers [ i ] ,
215 { " $set " : {
216 " time_start " : time_starts [ i ] ,
217 " time_end " : time_ends [ i ] ,
218 " last_updated " : datetime . now ( pytz . utc ) ,
219 }}
220 ))
221

222 if new_inserts_seriess :
223 await ( await self . _asyn_get_collection ( dtype , dformat , dfreq ) ) .
insert_many ( new_inserts_seriess , ordered = False )
224 if n ew _ i n se r t s _s e r i es _ m e ta s :
225 await ( await self . _ a s y n _ g e t _ c o l l e c t i o n _ m et a ( dtype , dformat , dfreq ) ) .
insert_many ( new_inserts_series_metas , ordered = False )
226 if n ew _ u p da t e s _s e r i es _ m e ta s :
227 await ( await self . _ a s y n _ g e t _ c o l l e c t i o n _ m et a ( dtype , dformat , dfreq ) ) .
bulk_write ( new_updates_series_metas , ordered = False )
228

229 print ( f " FINISH BATCH INSERTING ID { metalogs } " )


230 return True
231

32
232 async def asyn_read_timeseries (
233 self , dtype , dformat , dfreq , period_start , period_end , series_metadata ,
series_identifier , metalogs = " "
234 ):
235 result_ranges , result_dfs = await self . a s y n _ b a t c h _ r e a d _ t i m e s e r i e s (
236 dtype = dtype ,
237 dformat = dformat ,
238 dfreq = dfreq ,
239 period_start = period_start ,
240 period_end = period_end ,
241 series_metadatas =[ series_metadata ] ,
242 series_identifiers =[ series_identifier ] ,
243 metalogs =[ metalogs ]
244 )
245 return ( result_ranges [0] , result_dfs [0])
246

247 async def a s y n _ b a t c h _ r e a d _ t i m e s e r i e s (


248 self , dtype , dformat , dfreq , period_start , period_end , series_metadatas ,
series_identifiers , metalogs =[]
249 ):
250 assert ( not period_start or period_start > DbService . earliest_utc_cutoff )
251 assert ( not period_end or period_end > DbService . earliest_utc_cutoff )
252 self . _ensure_coll ( dtype = dtype , dformat = dformat , dfreq = dfreq , coll_type = "
timeseries " )
253 docs = await ( await self . _ a s y n _ g e t_ c o l l e c t i o n _ m e t a ( dtype , dformat , dfreq ) )
. find ({ " $or " : series_identifiers }) . to_list ( length = None )
254 series_records = DbService . m a t c h _ i d e nt i f i e r s _ t o _ d o c s ( series_identifiers ,
docs )
255 series_filter = g e t _ m o n g o _ d a t e t i m e _ f i lt e r ( period_start = period_start ,
period_end = period_end )
256 async def poll_record ( i ) :
257 matched = series_records [ i ]
258 assert ( len ( matched ) <= 1)
259 if len ( matched ) == 0:
260 return ( None , None ) , pd . DataFrame ()
261 if len ( matched ) == 1:
262 records = await ( await self . _asyn_get_collection ( dtype , dformat ,

33
dfreq ) ) . find (
263 {
264 ** series_filter ,
265 **{ " metadata .{} " . format ( k ) : v for k , v in series_metadatas
[ i ]. items () }
266 }
267 ) . to_list ( length = None )
268 record_df = pd . DataFrame ( records ) . drop ( columns =[ " metadata " , " _id "
]) if records else pd . DataFrame ()
269 record_start , record_end = None , None
270 if len ( record_df ) > 0 :
271 record_df [ " datetime " ] = pd . to_datetime ( record_df [ " datetime " ]) .
dt . tz_localize ( pytz . utc )
272 record_start , record_end = record_df [ " datetime " ]. values [0] ,
record_df [ " datetime " ]. values [ -1]
273 record_start = datetime . fromtimestamp ( int ( record_start ) /1 e9 ,
tz = pytz . utc )
274 record_end = datetime . fromtimestamp ( int ( record_end ) /1 e9 , tz =
pytz . utc )
275 return ( record_start , record_end ) , record_df
276

277 db_polls = await asyncio . gather (*[ poll_record ( i ) for i in range ( len (
se ri es _identifiers ) ) ] , return _exceptions = False )
278 result_ranges = [ db_poll [0] for db_poll in db_polls ]
279 result_dfs = [ db_poll [1] for db_poll in db_polls ]
280 return result_ranges , result_dfs

Listing 7: db/db service.py

34

You might also like