proposal: unicode/utf8: add LastPartialRuneLen #73149

rogpeppe · 2025-04-03T12:08:00Z

Proposal Details

There are some situations where it's desirable to split a string at some known size but avoid splitting UTF-8 sequences. Examples include truncating a string to a limited length when logging, and doing character-oriented manipulation within an io.Reader.

Currently the easiest way to do that is by traversing from the start of the string, decoding runes until the desired split point is reached, but this is unnecessarily inefficient.

Getting this right is surprisingly tricky.

I propose we add two new functions to unicode/utf8:

// LastPartialRuneLen returns the number of bytes at the
// end of p that might be the start of a valid UTF-8 byte
// sequence.
func LastPartialRuneLen(p []byte) int

// LastPartialRuneLenInString is like LastPartialRuneLen
// but for strings.
func LastPartialRuneLenInString(s string) int

Here's a possible implementation, very lightly tested as yet:

func LastPartialRune[T ~[]byte|~string](p T) int {
	end := len(p)
	if end == 0 {
		return 0
	}
	start := end - 1
	lim := max(0, end-utf8.UTFMax)
	for ; start >= lim; start-- {
		r := p[start]
		if r < utf8.RuneSelf {
			return 0
		}
		if r&0b1100_0000 == 0b1000_0000 {
			// continuation byte.
			continue
		}
		if r, size := utf8.DecodeRune(p[start:]); r != utf8.RuneError || size > 1 {
			return 0
		}
		return end - start
	}
	// It's all continuation characters up to here.
	// They can't _all_ be continuation characters:
	// the last one definitely isn't, so we can't consider
	// it a partial rune.
	return 0
}

It might be nice to add it as a generic function that works on []byte and string, but that would need a generic version of DecodeRune too.

The text was updated successfully, but these errors were encountered:

robpike · 2025-04-03T12:22:35Z

Here's a different suggestion that seems more generally useful: a function that returns the longest string consisting only of complete rune encodings up to a provided length. The text package could provide another that returns only complete character encodings (it's a much harder problem).

The doc and signature:

// TruncateToRune returns the longest prefix of s, up to a maximum length of n,
// terminating at the boundary after the last complete UTF-8 encoding that fits.
func TruncateToRune(s string, n int) string

gabyhelp · 2025-04-03T12:33:03Z

Related Issues

Related Documentation

_{(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)}

rogpeppe · 2025-04-03T12:59:53Z

So TruncateToRune is this, right?

func TruncateToRune(s string, n int) string {
   s = s[:n]
   return s[:len(s)-LastPartialRuneLen(s)]
}

Yup, I think I like that more. It's definitely a better name, and when I tried out a couple of examples, they turned out to look nicer even when they already had lots of index-oriented logic.

Here's the logic for splitting a slice:

	p0, p1 := buf[:split], buf[split:]
	p0 = TruncateToRune(buf[:split])
	p1 = buf[len(p0):]

Here's some logic for truncating a slice passed to a reader and moving the partial bytes to the front of the buffer.

	buf1 := TruncateToRune(buf, len(buf))
	if len(buf1) > 0 && !yield(buf1, nil) {
		return
	}
	copy(buf, buf[len(buf1):])
	buf = buf[:len(buf)-len(buf1)]

adonovan · 2025-04-03T14:50:57Z

Is the n parameter necessary? Can the caller just pass s[:n] if they want an n < len(s)?

Also, the wording here at TruncateToRune:

a function that returns the longest string consisting only of complete rune encodings up to a provided length.

suggests that it passes over the entire string, whereas the original proposal was a local operation at the end of the string. If s is a long buffer containing an incomplete rune encoding (that would decode to U+FFFD) in the middle, then TruncateToRune sounds like it must stop in the middle.

rogpeppe · 2025-04-03T15:04:29Z

Is the n parameter necessary? Can the caller just pass s[:n] if they want an n < len(s)?

I was thinking that too. Then it can never panic, which is a nice property to have.

containing an incomplete rune encoding (that would decode to U+FFFD)

My reading of that was that there are no incomplete rune encodings in the middle of the string, just erroneous encodings,
because we generate something for those byte sequences.

robpike · 2025-04-03T23:27:06Z

Yes, you could drop the integer count but I think it's a cleaner design if you don't, as it reinforces the idea that its purpose is to make it fit. Otherwise it seems just odd to me. But I'm not adamant.

One thing that should be explicit is that the return value is a slice of the original. There is no conversion to error runes, for instance. There probably needs to be some incredibly fussy documentation about trailing invalid encodings.

rogpeppe added the Proposal label Apr 3, 2025

gopherbot added this to the Proposal milestone Apr 3, 2025

gabyhelp added the LibraryProposal Issues describing a requested change to the Go standard library or x/ libraries, but not to a tool label Apr 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proposal: unicode/utf8: add LastPartialRuneLen #73149

proposal: unicode/utf8: add LastPartialRuneLen #73149

rogpeppe commented Apr 3, 2025

robpike commented Apr 3, 2025

gabyhelp commented Apr 3, 2025

rogpeppe commented Apr 3, 2025 •

edited

Loading

adonovan commented Apr 3, 2025 •

edited

Loading

rogpeppe commented Apr 3, 2025

robpike commented Apr 3, 2025

proposal: unicode/utf8: add LastPartialRuneLen #73149

proposal: unicode/utf8: add LastPartialRuneLen #73149

Comments

rogpeppe commented Apr 3, 2025

Proposal Details

robpike commented Apr 3, 2025

gabyhelp commented Apr 3, 2025

rogpeppe commented Apr 3, 2025 • edited Loading

adonovan commented Apr 3, 2025 • edited Loading

rogpeppe commented Apr 3, 2025

robpike commented Apr 3, 2025

rogpeppe commented Apr 3, 2025 •

edited

Loading

adonovan commented Apr 3, 2025 •

edited

Loading