Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternative interchange formats #58

Open
goodboy opened this issue Feb 17, 2019 · 5 comments
Open

Alternative interchange formats #58

goodboy opened this issue Feb 17, 2019 · 5 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@goodboy
Copy link
Owner

goodboy commented Feb 17, 2019

The list I've been meaning to look through/support:

Maybe more?

We'll need to abstract the channel API to take in different types of stream types.
This work will require coordination for the alt transport work in #19.

@goodboy goodboy added enhancement New feature or request help wanted Extra attention is needed labels Feb 17, 2019
@goodboy
Copy link
Owner Author

goodboy commented Mar 1, 2020

Here is an extremely good write up on the shortcomings of pandas from the original author with links to many other great resources.

Apache arrow seems to very much be a solution to many of the prior memory constraint and inter-process ailments of big data with pandas. I haven't dug too much into recent developments but this article seems like a good entrypoint.

Anyone wanting to take a look at the ipc section in pyarrow might be able to get something cool going quickly!

@salotz
Copy link
Collaborator

salotz commented May 29, 2020

You might be interested in this as well: https://github.com/real-logic/aeron

and the binary encoding it uses: https://github.com/real-logic/simple-binary-encoding

Designed for extremely low latency trading systems. There is a C++ implementation, and there is no python interface atm though. Not sure exactly what sauce they are using that is better than say, CapNProto.

All of them are probably useful in different situations. Which complicates things..

Blosc AFAIK is just a compression algorithm. Still useful, and can be used transparently (would require intelligence about when data is moving over I/O), but perhaps should be a user level thing. My suspicion is that Arrow has compression specifically accounted for, although I don't know.

@salotz
Copy link
Collaborator

salotz commented May 29, 2020

For the sake of interestingness, although its likely of no use to use is: https://kaitai.io/

@goodboy
Copy link
Owner Author

goodboy commented Jun 5, 2020

Also #8 mentions msgpack-numpy.

While not a new interchange it is a system worth comparing against when considering alternatives.

@goodboy
Copy link
Owner Author

goodboy commented Apr 1, 2021

Interesting historical format SBE - simple binary encoding that's (was?) used in financial systems.

The end result of applying these design principles is a codec that has ~16-25 times greater throughput than Google Protocol Buffers (GPB) with very low and predictable latency. This has been observed in micro-benchmarks and real-world application use. A typical market data message can be encoded, or decoded, in ~25ns compared to ~1000ns for the same message with GPB on the same hardware. XML and FIX tag value messages are orders of magnitude slower again.

The sweet spot for SBE is as a codec for structured data that is mostly fixed size fields which are numbers, bitsets, enums, and arrays. While it does work for strings and blobs, many my find some of the restrictions a usability issue. These users would be better off with another codec more suited to string encoding.

Sounds like it would need to be compared with capnproto - haven't dug into any libs yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants