-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: io: add Seq for efficient, zero-copy I/O #73154
Comments
Drive-by observation: one implication of using an for data := range io.SeqFromReader(r, 4096) {
// ...
} and not realize that there's an error that's being ignored. I don't know whether this is a problem, but something to consider. |
Yes, that's an issue with |
Just a note on the PoC. It should yield with the slice capacity explicitly set, |
This is potentially nice for certain things, but I think that the documentation should note that if you need to read in a pull manner then you should use a regular Here's some results from
Here's the code that was used to test: https://gist.github.com/DeedleFake/b931c0393e4385dd39118ea8088a5729 |
Absolutely! That was why I included that final benchmark. This is no panacea, just another tool in the toolbox. I suspect that many of the APIs that currently take a Writer as argument and return a Writer could be straightforwardly and advantageously rewritten as |
I've just realised that this function should probably be part of the API too:
That would make it easy to efficiently bridge between |
Ah, I completely missed somehow that that was a benchmark for a pull iterator. Sorry about the duplication. |
nvm, I think i misunderstood the proposed semantics. |
I don't really agree on this point. For one, That same problem applies to composing the As for efficiency, could we use the same underlying coroutine switches to make |
Yes, those are fair points. What you're talking about there is a pull-based API, where the caller can decide how much they want and when to acquire it. This Seq API is, like all iterators, fundamentally a push-based API, but phrased in terms of iterators, which make it somewhat more convenient to pull the values that are being pushed. That's why we have iter.Pull and the
Yup, there are certain fundamentals here that are unavoidable. Some abstraction combinations require a buffer, some don't. Somewhat related, here's an interesting transform I discovered earlier:
That is, we can rewrite any writer-to-writer function as a seq-to-seq function. This requires no buffer or data copying or coroutines. (Aside: I'm looking for a good name for this function). That means my previous assertion that "many of the APIs that currently take a Writer as argument and return a Writer could be straightforwardly and advantageously [be] rewritten as Essentially what's going on under the hood there is we're just passing a Writer implementation that invokes a callback function when Write is called, nothing special. But the syntax sugar and the Pull support makes it something a bit more, I think.
I'm pretty sure that's impossible. |
I want to claim a small amount of partial credit for nerd-sniping Roger into this weird A common example of this is if you want to have gzipped content in the form of an The normal way of dealing with this mismatch might look something like: func main() {
pr, pw := io.Pipe()
go func() {
zw := gzip.NewWriter(pw)
if _, err := io.Copy(zw, os.Stdin); err != nil {
pw.CloseWithError(err)
return
}
if err := zw.Close(); err != nil {
pw.CloseWithError(err)
return
}
pw.Close()
}()
resp, err := http.Post("http://example.com", "application/gzip", pr)
if err != nil {
panic(err)
}
fmt.Println(resp.Status)
} You can instead compose a bunch of things from this proposal into: func main() {
in := ioseq.SeqFromReader(os.Stdin, 1024*32) // Same buffer size as io.Copy
cb := ioseq.WriterFuncToSeq(func(w io.Writer) io.WriteCloser {
return gzip.NewWriter(w)
})
body := ioseq.ReaderFromSeq(cb(in))
resp, err := http.Post("http://example.com", "application/gzip", body)
if err != nil {
panic(err)
}
fmt.Println(resp.Status)
} Even with something CPU-intensive like gzip, I see ~10% speedup using roger's zany package over |
If we make WriterFuncToSeq generic on the return type, then it becomes a bit more ergonomic. A bunch of functions in the stdlib (including gzip.NewWriter) satisfy So the boilerplate above can be a bit shorter:
You could even encapsulate the above into a function, I guess:
|
It should be better now that I've realised that it makes sense to impement |
FYI: Here is the implementation for a "concurrent" Push iterator, which is the opposite of iter.Pull. |
That's interesting - I hadn't seen that. Thanks. However, you see that as directly relevant here? AFAICS that primitive is targeted at more general coroutine-oriented APIs, and this doesn't seem like it would benefit, but I'm probably missing something. |
I closely follow the Go proposal process and try to read and understand all proposals and design documments. I also like to read the Go code of different projects I use. Normally this is no problem and I can easily understand everything immediately and have no problem wrapping my head around actual code. And I love this about Go. It's probably the main reason I use it. But this proposal is different, at least for me. I feel like this is the first time in a long time (months? years?) that I actually had to take some code and stare at it for what felt like minutes to fully understand what it does and why we would need it. When trying out some of the code I actually ended up inlining some code, as it made it far easier for me to understand what was happening, even if the code was "much longer". Still, even then I never found the result to be easy to understand. The example from rogpeppe here is a good example: // PipeThrough returns a reader that pipes the content from r through f.
func PipeThrough[W io.WriteCloser](r io.Reader, f func(io.Writer) W) io.Reader {
return ReaderFromSeq(WriterFuncToSeq(f)(SeqFromReader(r, 1024 * 32)))
} I understand what this does and that it can be quite useful, but it still took me a while to understand the implementation. Part of this is probably because I find the function names (ReaderFromSeq, WriterFuncToSeq and SeqFromReader) hard to understand, especially when combining the results of all 3 of them. And that is my complaint here and the reason I gave a thumbs down: I feel like this is too much complexity for what (in my opionion) little benefit it provides. I would really dislike seeing this not only be added to the stdlib, but potentially even seeing the complexity find its way across the stdlib. Given that one can convert from the interfaces to sequences, maybe this could be implemented outside the stdlib? Or at least in a different package, with other stdlib packages continuing to use the interfaces as much as possible. I also think it is important to mention that io.Reader and io.Writer are probably some of the most used and most important interfaces in Go. Changing how we work with them, even without looking at the exact proposal, already feels like it could easily lead to bigger change (disruption?) than for example the introduction of the context package. |
In type Seq interface {
Bytes() []byte
Text() string
Err() error
Advance() func(func() bool) // iter.Seq0
// No Close() error method because Advance auto-closes when iteration finishes
}
// usage like
for range seq.Advance() {
if seq.Err() != nil {
// do something, break, return
}
process(seq.Bytes())
} |
@earthboundkid that seems like a significant increase in complexity - particularly given that it doesn't really protect from reusing or keeping each |
I think we don't really know how to use fallible sequences in Go yet. For my file walking iterator package, I ended up with what I think is an interesting API. A I don't know if that's the right approach to managing fallible iterators in Go, but it's one I'm experimenting with. I think you could probably do something similar with |
I'm concerned that there are currently various proposals all independently inventing patterns for fallible sequences as a side-effect of proposing something else, and that the first one that gets accepted would effectively make a decision for the whole standard library as to what the idiomatic pattern is. The current support for infallible sequences was proposed and discussed separately from any single specific application of it, and I think it would be best to follow a similar approach for fallible sequences so that the proposal can consider the many different potential applications of that pattern at once and hopefully choose a compromise that is broadly applicable. I don't mean to say that the new capabilities in this proposal are not worthwhile -- it does seem like a nice improvement -- but accepting it seems to imply also accepting that (I will be explicit that I'm still somewhat unconvinced that we should be trying to apply range-over-func to fallible sequences at all, vs. a more explicit pattern using normal function/method calls, but for me that specific preference is overridden by the desire for there to be a single idiomatic design pattern -- even if it does involve |
[Edited to more benchmarks]
[Edited to add reservations about iterator use]
Proposal Details
This proposal adds two new functions and one new type to the
io
package:These enable more efficient and ergonomic I/O pipelines by leveraging Go's iterator functionality.
Background
When
io.Reader
was introduced into Go, its API was designed following the time-honored API of the Unixread
system call.The caller provides a buffer and the callee copies data into that buffer.
However, its API is notoriously hard to use. Witness the long doc comment. I'm sure most
of us have struggled to write
io.Reader
implementations and been caught out by gotchas whenusing it.
It's also somewhat inefficient. Although it amortizes byte-by-byte streaming cost, it's in general
not possible to turn an
io.Writer
(convenient to produce data) into anio.Reader
withoutthe need to copy data, because once the data has been copied there is no clear way
of signalling back to the producer of the data that it's done with. So
io.Pipe
is as goodas we can get, which involves goroutines and copying.
However now that iterators are a thing, there is a potential alternative.
I propose that we add a way to bridge from an iterator to
io.Reader
and vice versa.Proposal
I propose adding one type and two functions to the
io
package:Seq
defines a standard type for these sequences.SeqFromReader
returns aSeq
of byte slices, yielding chunks of data read from theReader
using an internal buffer.ReaderFromSeq
wraps aSeq
as aReadCloser
, implementingReader
semantics.The semantics of
Seq
slices are similar to those ofReader.Read
buffers: callers must not retain or mutate slices outside the current iteration. The sequence terminates at the first non-nil error, includingEOF
.Discussion
In general, no allocation or copying needs to take place when using this API.
Byte slices can be passed by reference directly between producer and consumer,
and the strong ownership conventions of an iterator make this a reasonable
approach. The coroutine-based
iter.Pull
function makesReaderFromSeq
considerably moreefficient than
io.Pipe
while providing many of the same advantages.The API is arguably easier to deal with:
Read
The fact that we can write a bridge between
Seq
andReader
mean thatthis new abstraction could fit nicely into the existing Go ecosystem.
It might also be useful to add another abstraction to make it easier to use
Writer
-oriented APIs with a generator. Perhaps something like this:Performance
I tried a few benchmarks to get an idea for performance:
The Pipe benchmarks measure the performance when using
SeqReader
asa substiture for
io.Pipe
with various workloads for between the sourceand the sink (a base64 encoder and a no-op pass-through).
This is to demonstrate how Seq can be used to improve performance
of some existing tasks.
The Reader benchmarks measure performance when using a Seq vs an io.Reader;
the no-op does nothing at all on producer or consumer; the FillIndex
just fills the buffer and runs bytes.Index on it (a fairly minimal workload).
This is to demonstrate how Seq is a very low overhead primitive for producing
readable data streams. Writing this benchmark, it was immediately obvious
that writing the
io.Reader
benchmark code was harder: I had to create anauxilliary struct type with fields to keep track of iterations, rather than just
write a simple for loop. This is the classic advantage of storing data in control flow.
So we might pay for this abstraction with a nanosecond of overhead, that
seems well worth the cost.
However, it's not all roses.
Seq
is very convenient when we want to write push-orientedcode, but if callers are usually going to convert it to a reader with
SeqReader
andit's not too hard to write the code in a pull-based style, we should do so:
Note that the overhead here is per-iteration, not per-byte, so as the buffer size grows and the per-iteration work grows proportionately, the overhead will reduce.
PoC code (including the benchmark code) is available at https://github.com/rogpeppe/ioseq.
Reservations about using iterators in this way
My biggest reservation about this proposal is the way that it uses iterators. In general, iterators work best when the values produced are independent of the loop. This enables us to use functions such as
slices.Collect
andmaps.Insert
which take values received from the iterator and store them elsewhere. This proposal uses iterators differently. The values produced are emphatically intended not to escape the loop in this way.That said, iterators of the general form
iter.Seq[T, error]
aren't that useful with collectors of this type anyway because of the final error value.I'm very much in two minds on this. One the one hand, the above properties are definitely nice to have. One the other hand, there are many pre-Go-iterator-support iterator-like APIs which it seems to me would be nice phrased as for-range loops. But those iterator-like APIs often return values which only valid for the extent of one iteration. bufio.Scanner.Bytes is a good example of this. In fact it feels like a great example because that's essentially exactly what
Seq
is doing. You can even useScanner
to mimic whatSeqFromReader
does.Another example of this dilemma is #64341.
Modulo explicitly "escaping" functions like
slices.Collect
, iterators do seem like a good fit here. The scope of a given iterator variable is well-defined, much like the argument to a function (which it is, of course, under the hood). And in general we don't seem to have a problem with saying that it's not OK to store the argument to a function outside the span of that function's invocation.So ISTM that we need to decide if it's OK to "bless" this kind of pattern. If it's not OK, then this proposal should be declined.
The text was updated successfully, but these errors were encountered: