Improving Our Database Service
Improving Our Database Service
HangukQuant1, 2 *
October 9, 2022
1 2
Abstract
This paper builds upon the previous paper [1] to reduce latency of our database read writes.
We also improve the code structure and code functionality, and employ tips and tricks to improve
our database service by reducing idle time. We use a single process, single threaded concurrency
model, making no assumptions about the CPU resources available to the developer - and achieve
superior results to our previous discussions which utilized multiprocessing. Using the previous
benchmark (Downloading & Populating Full History for 300 tickers for Unpopulated DB with
Read and Write Capacity), our best result in the previous paper was 3 minutes. We improve
this to 1 minute.
1
*1: [email protected], hangukquant.substack.com
2
*2: DISCLAIMER: the contents of this work are not intended as investment, legal, tax or any other advice, and
is for informational purposes only. It is illegal to make unauthorized copies, forward to an unauthorized user or to
post this article electronically without express written consent by HangukQuant.
1
Contents
1 Introduction 3
5 Generic Wrapper 9
8 Conclusion 25
A db service.py 26
2
1 Introduction
In the previous paper [1], we presented code with the following results:
N N N N 14.5M
Y N N N 9.5M
Y Y Y N 9M
Y Y Y Y 3M
Table 1: Results of Our Previous Paper: Downloading & Populating Full History for
300 tickers for Unpopulated DB with Read and Write Capacity
However, this was still quite begging of a better solution. Upon code profiling (see Paper:
Flirting with CPUs [2] on code profiling), there were two expensive operations that primarily
slowed us down. Firstly, unrolling the dataframe received and pressing it into the desired schema
of dictionaries was a costly operation - and significantly slowed us down. Unfortunately, this is
an irreducible cost - there is not much we can do about the time taken to create database entries
to fit our schema, and the only way we dealt with this was to use multiple CPUs to do this
task. However, this improvement was merely a bandage around the core issue of poor utilization
of compute. Secondly, when submitting large batches, we attempted to get all of the data from
the API at once - and this caused the server to ask us to resubmit our requests after some time.
However, since this current ‘batch’ is not yet fully retrieved, we did not perform work in the
meantime while waiting to poll data. This inefficiency means that we can get an improvement by
alternating between the two tasks, between unrolling mini-batches of dataframes and waiting for
API credits.
This paper explores a clean code structure to do the task with the asyncio event loop that
essentially should put our CPU to work at 100%. We see that even without multiple processes,
we can shorten the run time down to 1M for the same benchmark. We will also improve the
functionality in our code so that we can employ different data polling engines, such as different
data libraries without changing the higher level interface of our data service.
3
As usual, the same caveat applies. You should not expect the papers to be error-free, or emulate
perfection in any sense of the word. Books are often written over a few years, with consultations
and revisions by domain experts, and then mulled over and re-written. This instead is a product
of weekly musings, with significantly less mullings. There will inevitably be mistakes, misnomers
and shortfalls: please email me at [email protected] if any. All mistakes are my own.
The power of having our own database, as opposed to sticking to polling data from an API on
demand is that we can design our own custom requests, so long as our operations support it. We
will kill support for the synchronous functionalities, and focus on the asynchronous functionalities
provided, since there is not much reason for us to maintain them. Using coroutines give us the
advantage with working with light weight objects - threading takes much lesser resources than
multi-processing due to sharing of the same memory state-space, and coroutines take even lesser
overhead than threads. This is because we do not need to deal with context-switching and so on.
For demonstrative purposes, we will work with the most common functionality, which is to retrieve
timeseries data.
Recall our asyn batch get ohlcv functionality in our equities service class. It has the interface
of something like this:
Listing 1: equities.py
We were required to provide the end time of some timeseries, and then either the period start or
period days to get the window for the data that we want. Let’s modify it to become more flexible,
such that we can have more general behavior. For instance, specifying start and duration will give
us [start, start + duration], specifying nothing gives us all the data in our database and so on. The
logic will be as follows:
4
2 from dateutil . relativedelta import relativedelta
3
5
38 deltatime = relativedelta ( hours = duration )
39 if granularity == " d " :
40 deltatime = relativedelta ( days = duration )
41 if granularity == " M " :
42 deltatime = relativedelta ( months = duration )
43 if granularity == " y " :
44 deltatime = relativedelta ( years = duration )
45 assert ( deltatime )
46 return deltatime
47
A None value represents an unbounded request, which means we just want to read the database
6
and get all the data relevant. We define an internal nomenclature for granularity, which defines the
granularity of the timeseries under consideration. Here, we support
corresponding to tick, seconds, minute, hours, days, months and year. This utility function allows
us to ‘decode’ the amount of data that the caller of the equity service class wants. We shall then
map our internal granularities to the nomenclature accepted by an external API, as well as the
frequency of timeseries that our MongoDB API recognises.
We also want to decouple the actual equity service class from the data provider. Previously, we
made the calls to eodhistoricaldata API directly inside the equity service class. But what if we
want to include multiple different data providers, such as Yahoo Finance, MetaTrader terminal,
and Interactive Brokers?
Does getting equities data timeseries fundamentally differ from getting tick data or macroeconomic
GDP data? Not really - the general structure is that we want to first get the data from the time
series database, find which data periods we are missing data from, make those requests using an
API if desired and return the result to the caller. The function call to get OHLCV should be no
different from get record of rain volume.
Our new equities service class and get OHLCV function will look something like the following:
4 class Equities () :
5
7
8 def __init__ ( self , data_clients ={} , db_service = None ) :
9 self . data_clients = data_clients
10 self . db_service = db_service
11
8
41 elif engine == " DWX - MT5 " :
42 datapoller = await self
43 . data_clients [ " meta_client " ]
44 . g e t _m e t a ap i _ d at a _ m an a g e r ()
45 dformat = " CFD "
46
47 return await a s y n _ b a t c h _ ge t _ t i m e s e r i e s (
48 tickers = tickers ,
49 exchanges = exchanges ,
50 read_db = read_db ,
51 insert_db = insert_db ,
52 granularity = granularity ,
53 db_service = self . db_service ,
54 datapoller = datapoller ,
55 engine = engine ,
56 dtype = Equities . dtype ,
57 dformat = dformat ,
58 period_start = period_start ,
59 period_end = period_end ,
60 duration = duration ,
61 chunksize = chunksize ,
62 results =[] ,
63 tasks =[] ,
64 batch_id =0
65 )
Listing 3: equities.py
This is a much cleaner structure! For each datasource we want to add to our service class,
all we have to do is specify the engine, provide a datapoller object and specify the contract type,
whether it is ‘spot’, ‘CFD’, ‘futures’ and so on.
5 Generic Wrapper
So what does the generic wrapper look like? It implements the asyn batch get timeseries logic,
and is only aware that it is getting some timeseries data. Recall that we want to perform the
9
data schema unrolling and writing in between API requests. The former will inevitably take CPU
compute, while the API requests need to be spaced out to prevent overloading the API server and
earning us a TooManyRequests status code.
We shall present the function - this is the hardest part of the paper, and we will go through
each section in more detail. I promise you this will be a headache, but pain is good. First, the
whole function:
1 import asyncio
2 import pandas as pd
3
7 async def a s y n _ b a t c h _ g e t _ t i m e s e ri e s (
8 tickers , exchanges , read_db , insert_db , granularity ,
9 db_service , datapoller , engine , dtype , dformat ,
10 period_start = None , period_end = None , duration = None ,
11 chunksize =100 , results =[] , tasks =[] , batch_id =0
12 ):
13 print ( f " START BATCH { batch_id } " )
14 assert ( engine in [ " eodhistoricaldata " , " DWX - MT5 " ])
15 if not tickers :
16 print ( " gathering results " )
17 await asyncio . gather (* tasks )
18 return results
19
25 series_metadatas = None
26
10
31 " source " : f " eodhistoricaldata -{ exchanges [ i ]} "
32 }
33 for i in range ( len ( tickers ) )
34 ]
35 if engine == " DWX - MT5 " :
36 series_metadatas = [
37 {
38 " ticker " : tickers [ i ] ,
39 " source " : engine
40 }
41 for i in range ( len ( tickers ) )
42 ]
43
44 assert ( series_metadatas )
45 se ri es_identifiers = [{**{ " type " : " ticker_series " } , ** series_metadata } for
series_metadata in series_metadatas ]
46
11
tickers ) ) ]
67
100 request_tickers = []
101 req uest_exchanges = []
102 request_starts , request_ends = [] , []
12
103 request_metadatas , request_identifiers = [] , []
104 for i in range ( len ( tickers ) ) :
105 ticker = tickers [ i ]
106 v = requests [ ticker ]
107 request_tickers . extend ([ ticker ] * len ( v ) )
108 request_exchanges . extend ([ spec [ " exchange " ] for spec in v ])
109 request_starts . extend ([ spec [ " period_start " ] for spec in v ])
110 request_ends . extend ([ spec [ " period_end " ] for spec in v ])
111 request_metadatas . extend ([ series_metadatas [ i ]] * len ( v ) )
112 r equest_identifiers . extend ([ series_identifiers [ i ]] * len ( v ) )
113
13
140 . drop_duplicates ( " datetime " )
141 . reset_index ( drop = True )
142 ohlcvs . append ( df )
143 assert ( j == len ( request_results ) )
144
145 if insert_db :
146 insert_tickers = [
147 tickers [ i ] for i in range ( len ( tickers ) ) if not ohlcvs [ i ]. empty
148 ]
149 insert_ohlcvs = [
150 ohlcvs [ i ] for i in range ( len ( tickers ) ) if not ohlcvs [ i ]. empty
151 ]
152 i ns er t _s e ri es _ me t ad at a s = [
153 series_metadatas [ i ] for i in range ( len ( tickers ) ) if not ohlcvs [ i ].
empty
154 ]
155 i n s e r t _ s e r i es _ i d e n t i f i e r s = [
156 series_identifiers [ i ] for i in range ( len ( tickers ) ) if not ohlcvs [ i ].
empty
157 ]
158
14
175 temp_tickers [ chunksize :] , temp_exchanges [ chunksize :] , read_db , insert_db ,
176 granularity , db_service , datapoller , engine , dtype , dformat , period_start ,
period_end ,
177 duration , chunksize , results , tasks , batch_id +1
178 )
The asyn batch get timeseries function process the batches in size chunksize. Upon entry, it takes
the first chunksize tickers and exchanges (these variables might be badly named, since we are
handling general timeseries data!) and processes it. Then, if database write is required (which
has computationally costly schema creations), we spawn a task to put on the asyncio event loop.
However, we do not wait for the database write to return - instead we call the next mini-batch with
a recursive call, excluding the first chunksize work. The results parameter gets built up down the
call stack, and the database write tasks put on the event loop are collected in the tasks parameter.
These tasks are on the event loop, and whenever our datapoller are waiting for API requests over
the network, the pending tasks opportunistically grab on to the CPU compute to create our data
schema and perform database writes. Eventually, as the work remaining decreases down the call
stack, we will reach our last mini batch of size lesser than or equal to the chunksize. We will run
out of tickers to process, and this is our base case:
1 if not tickers :
2 print ( " gathering results " )
3 await asyncio . gather (* tasks )
4 return results
which waits for all the previous database write tasks to complete. The results that have been
built up are then returned back up the call stack to the caller. The code section after creates the
supporting schema based on the data source of choice, so that we can see where our populated
database got its results from.
Next, we have
1 result_dfs = []
15
2 if read_db :
3 result_ranges , result_dfs = await db_service . a s y n _ b a t c h _ r e a d _ t i m e s e r i e s (
4 dtype = dtype ,
5 dformat = dformat ,
6 dfreq = granularity ,
7 period_start = period_start ,
8 period_end = period_end ,
9 series_metadatas = series_metadatas ,
10 s eries_identifiers = series_identifiers ,
11 metalogs = batch_id
12 )
13 if not period_start or not period_end :
14 return result_dfs if result_dfs else [ pd . DataFrame () for _ in range ( len (
tickers ) ) ]
This essentially reads from the database if we enable reads. The return values are result ranges,
which is pair containing the start of data read and end of data read. If the data request is unbounded,
we return the retrieved dataframes back to the caller without seeking any data from the external
API. This is the behavior we discussed in get span in the previous section! We can also get a (None,
None) as the range, which specifies the wanted data was not found in the database.
Next, depending on the data we do not have that we want, we can build up request parameters
that we need. This is done by the following code section:
16
15 assert ( result_start and result_end )
16
34 request_tickers = []
35 req uest_exchanges = []
36 request_starts , request_ends = [] , []
37 request_metadatas , request_identifiers = [] , []
38 for i in range ( len ( tickers ) ) :
39 ticker = tickers [ i ]
40 v = requests [ ticker ]
41 request_tickers . extend ([ ticker ] * len ( v ) )
42 req uest_exchanges . extend ([ spec [ " exchange " ] for spec in v ])
43 request_starts . extend ([ spec [ " period_start " ] for spec in v ])
44 request_ends . extend ([ spec [ " period_end " ] for spec in v ])
45 req uest_metadatas . extend ([ series_metadatas [ i ]] * len ( v ) )
46 r eq u e st_identifiers . extend ([ series_identifiers [ i ]] * len ( v ) )
Next, our datapoller uses the built up requests to ask for data required. We do not know
about the nature of the datapoller, except that if implements the method asyn batch get ohlcv.
This datapoller could be a Yahoo Finance API wrapper, or some Quandl wrapper and so on.
17
2 tickers = request_tickers ,
3 exchanges = request_exchanges ,
4 period_starts = request_starts ,
5 period_ends = request_ends ,
6 granularity = granularity
7 )
The remainder
1 j = 0
2 ohlcvs = []
3 for i in range ( len ( tickers ) ) :
4 result_start , result_end = result_ranges [ i ]
5 db_df = result_dfs [ i ]
6 if not result_start and not result_end :
7 ohlcvs . append ( request_results [ j ])
8 j += 1
9 continue
10 head_df , tail_df = pd . DataFrame () , pd . DataFrame ()
11 if period_start < result_start :
12 head_df = request_results [ j ]
13 j += 1
14 if period_end > result_end :
15 tail_df = request_results [ j ]
16 j += 1
17 concat_dfs = [ head_df , db_df , tail_df ]
18 df = pd . concat ( concat_dfs , axis =0) . drop_duplicates ( " datetime " ) . reset_index (
drop = True )
19 ohlcvs . append ( df )
‘stitches’ together the relevant data timeseries using the database result, the requested head, and
the requested tail to get all the available data from period start to period end.
18
6 dfs = insert_ohlcvs ,
7 s eries_identifiers = insert_series_identifiers ,
8 series_metadatas = insert_series_metadatas ,
9 metalogs = batch_id
10 )
11 )
This takes the coroutine asyn batch insert timeseries df from our database service object and wraps
it in a task. Doing an await here on the coroutine directly would prevent us from moving to the
next batch, since the coroutine blocks on await. Instead, we need to wrap it in a task and schedule
it on the event loop. However, note that task creation does not trigger the event loop. We need to
defer to the event loop with await asyncio.sleep(0) which triggers an iteration of the event loop,
and the task is now waiting to run. The pending tasks are built up with tasks.append(task) and
results are built up with results.extend(ohlcvs), which is passed on the next recursive call in
1 return await a s y n _ b a t c h _g e t _ t i m e s e r i e s (
2 temp_tickers [ chunksize :] , temp_exchanges [ chunksize :] , read_db , insert_db ,
3 granularity , db_service , datapoller , engine , dtype , dformat , period_start ,
period_end ,
4 duration , chunksize , results , tasks , batch_id +1
For those who are not familiar with recursion in programming, this might not make any sense.
Take the red pill and catch up on it.
While that was uncomfortably long function to comprehend, the nice thing about this is it
that it should be pretty invariant to future changes! We should be able to extend new dat-
apollers and make minimal code changes, as long as the datapoller implements the ‘contract’
asyn batch get ohlcv. An even cleaner structure would be to enforce this contract with the use of
abstract base classes for a Datapoller interface - which we will leave for another day. Other service
classes such as the commodities service class and fx service class should use the same generic wrap-
per, making the service class implementation modularised and without concern of how timeseries
logic is implemented internally.
Now, we just need to know how datapollers work and review the read write functionality in
our database service object. There are some improvements we want to make, in terms of code
structure and database integrity. The first is that we want to remove redundant code - recall that
19
we implemented two separate functionalities, one for getting in batch (which is more efficient due
to the use of session-ing network requests), and one for getting a single result. The batch is just a
generalization of the singular request, where the batch is a singleton. We can hence focus all our
attention on the batch functions and let the singleton request call the batch request.
Secondly, we want to resolve the issue of different timezones of the data received. Unfortunately,
when retrieving data from the Mongo database our submitted data, the properties of timezone
awareness are lost. Hence, we want to ensure that our database series only accepts datetime
objects in timezone-aware state and immediately does the conversion on retrieval. Passing in a
timezone naive timeseries dataframe would be unsuccessful, and we would maintain the integrity
of our data in this way.
1 import os
2 import pytz
3 import json
4 import aiohttp
5 import requests
6 import calendar
7 import numpy as np
8 import pandas as pd
9 import urllib
10 import urllib . parse
11 import pytz
12
20
26
1 import aiohttp
21
2 import asyncio
3 import data_service . db_logs as db_logs
4
5 async def asyn c_aio http _get_ all ( urls , fmt = " json " , max_tries =10 , sleep_seconds =3) :
6 async with aiohttp . ClientSession () as session :
7 async def fetch ( url , tries =0) :
8 async with session . get ( url ) as response :
9 db_logs . DBLogs () . debug ( " try {} {} " . format ( url , tries ) )
10 assert ( fmt == " json " ) # only json format accepted for now
11
25 return await asyncio . gather (*[ fetch ( url ) for url in urls ] ,
ret urn_exceptions = True )
22
the sleep is not a big issue since we are not idling anymore - we just give up the current coroutine
execution but the event loop instead runs the database write tasks which were queued up on the
event loop before! The CPU is still under compute and no resources are wasted.
We just need to finish up by writing the code for timeseries functions in our database service class.
Since the logic is very similar to the previous paper, we just provide the code in the Appendix A
but do not elaborate. We will only highlight a few key points. Since it is rather lengthy, we will
provide the code files for generic wrapper.py and db service.py for your convenience. Note that
these are not the full code files and only contain the functionalities discussed in this paper!
When inserting new data, the datetime column is now compared to this cutoff
which throws an Exception if our datetime is not timezone-aware. This helps in the database
integrity.
@staticmethod
def unroll_df(df, metadata):
records = df.to_dict(orient="records")
for record in records: record.update({"metadata" : metadata})
return records
23
The previous schema unrolls required creating new dictionaries. This is costly! Profiling the
code shows that updating by adding new key-value pairs to an existing dictionary showed us that
this is faster than rehashing a dictionary.
@staticmethod
async def pool_unroll_df(dfs, metadatas, use_pool=False):
if not use_pool:
return [DbService.unroll_df(df, metadata) for df, metadata
in zip(dfs, metadatas)]
If the dataframes are large enough, it is worth the overhead to do the unrolling using new
processes. We can do this using a process pool, and hooking it to an executor with the asyncio
module.
24
This is because we made the function synchronous due to the possibility of race conditions
(even in a single threaded concurrency model! - see the previous paper why this is so) We add
the option to disable checks, and can be used for projects where the different collection types have
already matured.
The remainder of the logic is fairly similar, with the addition of timezone awareness on data
retrieved from the database.
8 Conclusion
We have managed to work through some examples - that required some technical depth and knowl-
edge of the underlying Python mechanism. In the previous paper, we applied well known, common
tools to try to make our code performant. These tools could be applied with a general distinguish-
ing between CPU intensive and I/O tasks. While they helped (and were rather easy to employ), a
deeper understanding of how the different tasks are handled and bottlenecks revealed we were still
idling serious computing power. Using asyncio, we showed how to handle this efficiency. We also
improved our database functionality to be more flexible to adopt more use cases without increasing
complexity. This was achieved with better standards in separation of concerns.
25
References
A db service.py
1 import os
2 import sys
3 import json
4 import pytz
5 import motor
6 import asyncio
7 import pathlib
8 import pymongo
9 import pandas as pd
10 import motor . motor_asyncio
11 import data_service . db_logs as db_logs
12
22 class DbService () :
23
26 def __init__ ( self , db_config_path = str ( pathlib . Path ( __file__ ) . parent . resolve () )
26
+ " / config . json " ) :
27 with open ( db_config_path , " r " ) as f :
28 config = json . load ( f )
29 os . environ [ ’ MONGO_CLUSTER ’] = config [ " mongo_cluster " ]
30 os . environ [ ’ MONGO_DB ’] = config [ " mongo_db " ]
31 self . mongo_cluster = pymongo . MongoClient ( os . getenv ( " MONGO_CLUSTER " ) )
32 self . mongo_db = self . mongo_cluster [ os . getenv ( " MONGO_DB " ) ]
33 self . asyn_mongo_cluster = motor
34 . motor_asyncio
35 . AsyncIOMotorClient ( os . getenv ( " MONGO_CLUSTER " ) )
36 self . asyn_mongo_db = self . asyn_mongo_cluster [ os . getenv ( " MONGO_DB " ) ]
37
38 @staticmethod
39 def unroll_df ( df , metadata ) :
40 records = df . to_dict ( orient = " records " )
41 for record in records : record . update ({ " metadata " : metadata })
42 return records
43
44 @staticmethod
45 async def pool_unroll_df ( dfs , metadatas , use_pool = False ) :
46 if not use_pool :
47 return [ DbService . unroll_df ( df , metadata ) for df , metadata
48 in zip ( dfs , metadatas ) ]
49
61 @staticmethod
62 def m a t c h _ i d e n t i f i e r s _t o _ d o c s ( identifiers , docs ) :
27
63 records = []
64 for identifier in identifiers :
65 matched = []
66 for doc in docs :
67 if identifier . items () <= doc . items () :
68 matched . append ( doc )
69 records . append ( matched )
70 return records
71
28
100 " {} _ {} _ {} " . format ( dtype , dformat , dfreq ) ,
101 timeseries ={
102 ’ timeField ’: ’ datetime ’ ,
103 ’ metaField ’: ’ metadata ’ ,
104 ’ granularity ’: frequency
105 },
106 check_exists = True
107 )
108 self . mongo_db . drop_collection ( " {} _ {} _ {} - meta " . format ( dtype , dformat ,
dfreq ) )
109 self . mongo_db . create_collection (
110 " {} _ {} _ {} - meta " . format ( dtype , dformat , dfreq )
111 )
112 if not exists and coll_type == " regular " :
113 self . mongo_db . create_collection (
114 " {} _ {} _ {} " . format ( dtype , dformat , dfreq ) ,
115 check_exists = True
116 )
117 return True
118
122 """
123 SECTION :: TIMESERIES
124 """
125 async def a s y n _ i n s e r t _ t i m e s e r i e s_ d f ( self , dtype , dformat , dfreq , df ,
series_metadata , series_identifier , metalogs = " " ) :
126 return await self . a s y n _ b a t c h _ i n s e r t _ t i m e s e r i e s _ d f (
127 dtype = dtype ,
128 dformat = dformat ,
129 dfreq = dfreq ,
130 dfs =[ df ] ,
131 series_metadatas =[ series_metadata ] ,
132 series_identifiers =[ series_identifier ] ,
133 metalogs =[ metalogs ]
29
134 )
135
30
165 if len ( matched ) == 0:
166 doc = {
167 ** series_identifiers [ i ] ,
168 **{
169 " time_start " : series_starts [ i ] ,
170 " time_end " : series_ends [ i ] ,
171 " last_updated " : datetime . now ( pytz . utc )
172 }
173 }
174 records = DbService . unroll_df ( dfs [ i ] , series_metadatas [ i ])
175 new_inserts_seriess . extend ( records )
176 n e w_ i n s er t s _ se r i e s_ m e ta s . append ( doc )
177 elif len ( matched ) == 1:
178 meta_record = matched [0]
179 meta_start = pytz . utc . localize ( meta_record [ " time_start " ])
180 meta_end = pytz . utc . localize ( meta_record [ " time_end " ])
181 contiguous_series = self . _c h e c k_ c o n ti g u o us _ s e ri e s (
182 record_start = meta_start ,
183 record_end = meta_end ,
184 new_start = series_starts [ i ] ,
185 new_end = series_ends [ i ]
186 )
187 if not contiguous_series :
188 db_logs
189 . DBLogs ()
190 . error ( f " { sys . _getframe () . f_code . co_name } got
discontiguous series ::{ metalogs [ i ]} " )
191
192 continue
193 new_head = dfs [ i ]. loc [ dfs [ i ][ " datetime " ] < meta_start ]
194 new_tail = dfs [ i ]. loc [ dfs [ i ][ " datetime " ] > meta_end ]
195 if len ( new_head ) + len ( new_tail ) > 0:
196 new_heads . append ( new_head )
197 new_tails . append ( new_tail )
198 time_starts . append ( min ( series_starts [ i ] , meta_start ) )
199 time_ends . append ( max ( series_ends [ i ] , meta_end ) )
200 update_identifers . append ( series_identifiers [ i ])
31
201 else :
202 db_logs . DBLogs () . error ( f " { sys . _getframe () . f_code . co_name } got meta
series corruption , series count gt 1::{ metalogs [ i ]} " )
203 exit ()
204
222 if new_inserts_seriess :
223 await ( await self . _asyn_get_collection ( dtype , dformat , dfreq ) ) .
insert_many ( new_inserts_seriess , ordered = False )
224 if n ew _ i n se r t s _s e r i es _ m e ta s :
225 await ( await self . _ a s y n _ g e t _ c o l l e c t i o n _ m et a ( dtype , dformat , dfreq ) ) .
insert_many ( new_inserts_series_metas , ordered = False )
226 if n ew _ u p da t e s _s e r i es _ m e ta s :
227 await ( await self . _ a s y n _ g e t _ c o l l e c t i o n _ m et a ( dtype , dformat , dfreq ) ) .
bulk_write ( new_updates_series_metas , ordered = False )
228
32
232 async def asyn_read_timeseries (
233 self , dtype , dformat , dfreq , period_start , period_end , series_metadata ,
series_identifier , metalogs = " "
234 ):
235 result_ranges , result_dfs = await self . a s y n _ b a t c h _ r e a d _ t i m e s e r i e s (
236 dtype = dtype ,
237 dformat = dformat ,
238 dfreq = dfreq ,
239 period_start = period_start ,
240 period_end = period_end ,
241 series_metadatas =[ series_metadata ] ,
242 series_identifiers =[ series_identifier ] ,
243 metalogs =[ metalogs ]
244 )
245 return ( result_ranges [0] , result_dfs [0])
246
33
dfreq ) ) . find (
263 {
264 ** series_filter ,
265 **{ " metadata .{} " . format ( k ) : v for k , v in series_metadatas
[ i ]. items () }
266 }
267 ) . to_list ( length = None )
268 record_df = pd . DataFrame ( records ) . drop ( columns =[ " metadata " , " _id "
]) if records else pd . DataFrame ()
269 record_start , record_end = None , None
270 if len ( record_df ) > 0 :
271 record_df [ " datetime " ] = pd . to_datetime ( record_df [ " datetime " ]) .
dt . tz_localize ( pytz . utc )
272 record_start , record_end = record_df [ " datetime " ]. values [0] ,
record_df [ " datetime " ]. values [ -1]
273 record_start = datetime . fromtimestamp ( int ( record_start ) /1 e9 ,
tz = pytz . utc )
274 record_end = datetime . fromtimestamp ( int ( record_end ) /1 e9 , tz =
pytz . utc )
275 return ( record_start , record_end ) , record_df
276
277 db_polls = await asyncio . gather (*[ poll_record ( i ) for i in range ( len (
se ri es _identifiers ) ) ] , return _exceptions = False )
278 result_ranges = [ db_poll [0] for db_poll in db_polls ]
279 result_dfs = [ db_poll [1] for db_poll in db_polls ]
280 return result_ranges , result_dfs
34