tf.strings.unicode_split
Stay organized with collections
Save and categorize content based on your preferences.
Splits each string in input
into a sequence of Unicode code points.
tf.strings.unicode_split(
input,
input_encoding,
errors='replace',
replacement_char=65533,
name=None
)
Used in the notebooks
Used in the guide |
Used in the tutorials |
|
|
result[i1...iN, j]
is the substring of input[i1...iN]
that encodes its
j
th character, when decoded using input_encoding
.
Args |
input
|
An N dimensional potentially ragged string tensor with shape
[D1...DN] . N must be statically known.
|
input_encoding
|
String name for the unicode encoding that should be used to
decode each string.
|
errors
|
Specifies the response when an input string can't be converted
using the indicated encoding. One of:
'strict' : Raise an exception for any illegal substrings.
'replace' : Replace illegal substrings with replacement_char .
'ignore' : Skip illegal substrings.
|
replacement_char
|
The replacement codepoint to be used in place of invalid
substrings in input when errors='replace' .
|
name
|
A name for the operation (optional).
|
Returns |
A N+1 dimensional int32 tensor with shape [D1...DN, (num_chars)] .
The returned tensor is a tf.Tensor if input is a scalar, or a
tf.RaggedTensor otherwise.
|
Example:
input = [s.encode('utf8') for s in (u'G\xf6\xf6dnight', u'\U0001f60a')]
tf.strings.unicode_split(input, 'UTF-8').to_list()
[[b'G', b'\xc3\xb6', b'\xc3\xb6', b'd', b'n', b'i', b'g', b'h', b't'],
[b'\xf0\x9f\x98\x8a']]
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates. Some content is licensed under the numpy license.
Last updated 2024-04-26 UTC.
[null,null,["Last updated 2024-04-26 UTC."],[],[],null,["# tf.strings.unicode_split\n\n\u003cbr /\u003e\n\n|--------------------------------------------------------------------------------------------------------------------------------------------|\n| [View source on GitHub](https://github.com/tensorflow/tensorflow/blob/v2.16.1/tensorflow/python/ops/ragged/ragged_string_ops.py#L294-L339) |\n\nSplits each string in `input` into a sequence of Unicode code points.\n\n#### View aliases\n\n\n**Compat aliases for migration**\n\nSee\n[Migration guide](https://www.tensorflow.org/guide/migrate) for\nmore details.\n\n[`tf.compat.v1.strings.unicode_split`](https://www.tensorflow.org/api_docs/python/tf/strings/unicode_split)\n\n\u003cbr /\u003e\n\n tf.strings.unicode_split(\n input,\n input_encoding,\n errors='replace',\n replacement_char=65533,\n name=None\n )\n\n### Used in the notebooks\n\n| Used in the guide | Used in the tutorials |\n|-----------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------|\n| - [Introduction to Tensors](https://www.tensorflow.org/guide/tensor) - [Unicode strings](https://www.tensorflow.org/text/guide/unicode) | - [Text generation with an RNN](https://www.tensorflow.org/text/tutorials/text_generation) |\n\n`result[i1...iN, j]` is the substring of `input[i1...iN]` that encodes its\n`j`th character, when decoded using `input_encoding`.\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n| Args ---- ||\n|--------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| `input` | An `N` dimensional potentially ragged `string` tensor with shape `[D1...DN]`. `N` must be statically known. |\n| `input_encoding` | String name for the unicode encoding that should be used to decode each string. |\n| `errors` | Specifies the response when an input string can't be converted using the indicated encoding. One of: \u003cbr /\u003e - `'strict'`: Raise an exception for any illegal substrings. - `'replace'`: Replace illegal substrings with `replacement_char`. - `'ignore'`: Skip illegal substrings. |\n| `replacement_char` | The replacement codepoint to be used in place of invalid substrings in `input` when `errors='replace'`. |\n| `name` | A name for the operation (optional). |\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n| Returns ------- ||\n|---|---|\n| A `N+1` dimensional `int32` tensor with shape `[D1...DN, (num_chars)]`. The returned tensor is a [`tf.Tensor`](../../tf/Tensor) if `input` is a scalar, or a [`tf.RaggedTensor`](../../tf/RaggedTensor) otherwise. ||\n\n\u003cbr /\u003e\n\n#### Example:\n\n input = [s.encode('utf8') for s in (u'G\\xf6\\xf6dnight', u'\\U0001f60a')]\n tf.strings.unicode_split(input, 'UTF-8').to_list()\n [[b'G', b'\\xc3\\xb6', b'\\xc3\\xb6', b'd', b'n', b'i', b'g', b'h', b't'],\n [b'\\xf0\\x9f\\x98\\x8a']]"]]