-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: unicode/utf8: add LastPartialRuneLen #73149
Comments
Here's a different suggestion that seems more generally useful: a function that returns the longest string consisting only of complete rune encodings up to a provided length. The text package could provide another that returns only complete character encodings (it's a much harder problem). The doc and signature:
|
So
Yup, I think I like that more. It's definitely a better name, and when I tried out a couple of examples, they turned out to look nicer even when they already had lots of index-oriented logic. Here's the logic for splitting a slice:
Here's some logic for truncating a slice passed to a reader and moving the partial bytes to the front of the buffer.
|
Is the n parameter necessary? Can the caller just pass s[:n] if they want an n < len(s)? Also, the wording here at TruncateToRune:
suggests that it passes over the entire string, whereas the original proposal was a local operation at the end of the string. If s is a long buffer containing an incomplete rune encoding (that would decode to U+FFFD) in the middle, then TruncateToRune sounds like it must stop in the middle. |
I was thinking that too. Then it can never panic, which is a nice property to have.
My reading of that was that there are no incomplete rune encodings in the middle of the string, just erroneous encodings, |
Yes, you could drop the integer count but I think it's a cleaner design if you don't, as it reinforces the idea that its purpose is to make it fit. Otherwise it seems just odd to me. But I'm not adamant. One thing that should be explicit is that the return value is a slice of the original. There is no conversion to error runes, for instance. There probably needs to be some incredibly fussy documentation about trailing invalid encodings. |
Proposal Details
There are some situations where it's desirable to split a string at some known size but avoid splitting UTF-8 sequences. Examples include truncating a string to a limited length when logging, and doing character-oriented manipulation within an
io.Reader
.Currently the easiest way to do that is by traversing from the start of the string, decoding runes until the desired split point is reached, but this is unnecessarily inefficient.
Getting this right is surprisingly tricky.
I propose we add two new functions to
unicode/utf8
:Here's a possible implementation, very lightly tested as yet:
It might be nice to add it as a generic function that works on []byte and string, but that would need a generic version of DecodeRune too.
The text was updated successfully, but these errors were encountered: