Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: unicode/utf8: add LastPartialRuneLen #73149

Open
rogpeppe opened this issue Apr 3, 2025 · 6 comments
Open

proposal: unicode/utf8: add LastPartialRuneLen #73149

rogpeppe opened this issue Apr 3, 2025 · 6 comments
Labels
LibraryProposal Issues describing a requested change to the Go standard library or x/ libraries, but not to a tool Proposal
Milestone

Comments

@rogpeppe
Copy link
Contributor

rogpeppe commented Apr 3, 2025

Proposal Details

There are some situations where it's desirable to split a string at some known size but avoid splitting UTF-8 sequences. Examples include truncating a string to a limited length when logging, and doing character-oriented manipulation within an io.Reader.

Currently the easiest way to do that is by traversing from the start of the string, decoding runes until the desired split point is reached, but this is unnecessarily inefficient.

Getting this right is surprisingly tricky.

I propose we add two new functions to unicode/utf8:

// LastPartialRuneLen returns the number of bytes at the
// end of p that might be the start of a valid UTF-8 byte
// sequence.
func LastPartialRuneLen(p []byte) int

// LastPartialRuneLenInString is like LastPartialRuneLen
// but for strings.
func LastPartialRuneLenInString(s string) int

Here's a possible implementation, very lightly tested as yet:

func LastPartialRune[T ~[]byte|~string](p T) int {
	end := len(p)
	if end == 0 {
		return 0
	}
	start := end - 1
	lim := max(0, end-utf8.UTFMax)
	for ; start >= lim; start-- {
		r := p[start]
		if r < utf8.RuneSelf {
			return 0
		}
		if r&0b1100_0000 == 0b1000_0000 {
			// continuation byte.
			continue
		}
		if r, size := utf8.DecodeRune(p[start:]); r != utf8.RuneError || size > 1 {
			return 0
		}
		return end - start
	}
	// It's all continuation characters up to here.
	// They can't _all_ be continuation characters:
	// the last one definitely isn't, so we can't consider
	// it a partial rune.
	return 0
}

It might be nice to add it as a generic function that works on []byte and string, but that would need a generic version of DecodeRune too.

@gopherbot gopherbot added this to the Proposal milestone Apr 3, 2025
@robpike
Copy link
Contributor

robpike commented Apr 3, 2025

Here's a different suggestion that seems more generally useful: a function that returns the longest string consisting only of complete rune encodings up to a provided length. The text package could provide another that returns only complete character encodings (it's a much harder problem).

The doc and signature:

// TruncateToRune returns the longest prefix of s, up to a maximum length of n,
// terminating at the boundary after the last complete UTF-8 encoding that fits.
func TruncateToRune(s string, n int) string

@gabyhelp gabyhelp added the LibraryProposal Issues describing a requested change to the Go standard library or x/ libraries, but not to a tool label Apr 3, 2025
@rogpeppe
Copy link
Contributor Author

rogpeppe commented Apr 3, 2025

So TruncateToRune is this, right?

func TruncateToRune(s string, n int) string {
   s = s[:n]
   return s[:len(s)-LastPartialRuneLen(s)]
}

Yup, I think I like that more. It's definitely a better name, and when I tried out a couple of examples, they turned out to look nicer even when they already had lots of index-oriented logic.

Here's the logic for splitting a slice:

	p0, p1 := buf[:split], buf[split:]
	p0 = TruncateToRune(buf[:split])
	p1 = buf[len(p0):]

Here's some logic for truncating a slice passed to a reader and moving the partial bytes to the front of the buffer.

	buf1 := TruncateToRune(buf, len(buf))
	if len(buf1) > 0 && !yield(buf1, nil) {
		return
	}
	copy(buf, buf[len(buf1):])
	buf = buf[:len(buf)-len(buf1)]

@adonovan
Copy link
Member

adonovan commented Apr 3, 2025

Is the n parameter necessary? Can the caller just pass s[:n] if they want an n < len(s)?

Also, the wording here at TruncateToRune:

a function that returns the longest string consisting only of complete rune encodings up to a provided length.

suggests that it passes over the entire string, whereas the original proposal was a local operation at the end of the string. If s is a long buffer containing an incomplete rune encoding (that would decode to U+FFFD) in the middle, then TruncateToRune sounds like it must stop in the middle.

@rogpeppe
Copy link
Contributor Author

rogpeppe commented Apr 3, 2025

Is the n parameter necessary? Can the caller just pass s[:n] if they want an n < len(s)?

I was thinking that too. Then it can never panic, which is a nice property to have.

containing an incomplete rune encoding (that would decode to U+FFFD)

My reading of that was that there are no incomplete rune encodings in the middle of the string, just erroneous encodings,
because we generate something for those byte sequences.

@robpike
Copy link
Contributor

robpike commented Apr 3, 2025

Yes, you could drop the integer count but I think it's a cleaner design if you don't, as it reinforces the idea that its purpose is to make it fit. Otherwise it seems just odd to me. But I'm not adamant.

One thing that should be explicit is that the return value is a slice of the original. There is no conversion to error runes, for instance. There probably needs to be some incredibly fussy documentation about trailing invalid encodings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
LibraryProposal Issues describing a requested change to the Go standard library or x/ libraries, but not to a tool Proposal
Projects
None yet
Development

No branches or pull requests

5 participants