tf.strings.unicode_transcode
Stay organized with collections
Save and categorize content based on your preferences.
Transcode the input text from a source encoding to a destination encoding.
tf.strings.unicode_transcode(
input: Annotated[Any, _atypes.String],
input_encoding: str,
output_encoding: str,
errors: str = 'replace',
replacement_char: int = 65533,
replace_control_characters: bool = False,
name=None
) -> Annotated[Any, _atypes.String]
Used in the notebooks
The input is a string tensor of any shape. The output is a string tensor of
the same shape containing the transcoded strings. Output strings are always
valid unicode. If the input contains invalid encoding positions, the
errors
attribute sets the policy for how to deal with them. If the default
error-handling policy is used, invalid formatting will be substituted in the
output by the replacement_char
. If the errors policy is to ignore
, any
invalid encoding positions in the input are skipped and not included in the
output. If it set to strict
then any invalid formatting will result in an
InvalidArgument error.
This operation can be used with output_encoding = input_encoding
to enforce
correct formatting for inputs even if they are already in the desired encoding.
If the input is prefixed by a Byte Order Mark needed to determine encoding
(e.g. if the encoding is UTF-16 and the BOM indicates big-endian), then that
BOM will be consumed and not emitted into the output. If the input encoding
is marked with an explicit endianness (e.g. UTF-16-BE), then the BOM is
interpreted as a non-breaking-space and is preserved in the output (including
always for UTF-8).
The end result is that if the input is marked as an explicit endianness the
transcoding is faithful to all codepoints in the source. If it is not marked
with an explicit endianness, the BOM is not considered part of the string itself
but as metadata, and so is not preserved in the output.
Examples:
tf.strings.unicode_transcode(["Hello", "TensorFlow", "2.x"], "UTF-8", "UTF-16-BE")
<tf.Tensor: shape=(3,), dtype=string, numpy=
array([b'\x00H\x00e\x00l\x00l\x00o',
b'\x00T\x00e\x00n\x00s\x00o\x00r\x00F\x00l\x00o\x00w',
b'\x002\x00.\x00x'], dtype=object)>
tf.strings.unicode_transcode(["A", "B", "C"], "US ASCII", "UTF-8").numpy()
array([b'A', b'B', b'C'], dtype=object)
Args |
input
|
A Tensor of type string .
The text to be processed. Can have any shape.
|
input_encoding
|
A string .
Text encoding of the input strings. This is any of the encodings supported
by ICU ucnv algorithmic converters. Examples: "UTF-16", "US ASCII", "UTF-8" .
|
output_encoding
|
A string from: "UTF-8", "UTF-16-BE", "UTF-32-BE" .
The unicode encoding to use in the output. Must be one of
"UTF-8", "UTF-16-BE", "UTF-32-BE" . Multi-byte encodings will be big-endian.
|
errors
|
An optional string from: "strict", "replace", "ignore" . Defaults to "replace" .
Error handling policy when there is invalid formatting found in the input.
The value of 'strict' will cause the operation to produce a InvalidArgument
error on any invalid input formatting. A value of 'replace' (the default) will
cause the operation to replace any invalid formatting in the input with the
replacement_char codepoint. A value of 'ignore' will cause the operation to
skip any invalid formatting in the input and produce no corresponding output
character.
|
replacement_char
|
An optional int . Defaults to 65533 .
The replacement character codepoint to be used in place of any invalid
formatting in the input when errors='replace' . Any valid unicode codepoint may
be used. The default value is the default unicode replacement character is
0xFFFD or U+65533.)
Note that for UTF-8, passing a replacement character expressible in 1 byte, such
as ' ', will preserve string alignment to the source since invalid bytes will be
replaced with a 1-byte replacement. For UTF-16-BE and UTF-16-LE, any 1 or 2 byte
replacement character will preserve byte alignment to the source.
|
replace_control_characters
|
An optional bool . Defaults to False .
Whether to replace the C0 control characters (00-1F) with the
replacement_char . Default is false.
|
name
|
A name for the operation (optional).
|
Returns |
A Tensor of type string .
|
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates. Some content is licensed under the numpy license.
Last updated 2024-04-26 UTC.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Missing the information I need","missingTheInformationINeed","thumb-down"],["Too complicated / too many steps","tooComplicatedTooManySteps","thumb-down"],["Out of date","outOfDate","thumb-down"],["Samples / code issue","samplesCodeIssue","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2024-04-26 UTC."],[],[],null,["# tf.strings.unicode_transcode\n\n\u003cbr /\u003e\n\nTranscode the input text from a source encoding to a destination encoding.\n\n#### View aliases\n\n\n**Compat aliases for migration**\n\nSee\n[Migration guide](https://www.tensorflow.org/guide/migrate) for\nmore details.\n\n[`tf.compat.v1.strings.unicode_transcode`](https://www.tensorflow.org/api_docs/python/tf/strings/unicode_transcode)\n\n\u003cbr /\u003e\n\n tf.strings.unicode_transcode(\n input: Annotated[Any, _atypes.String],\n input_encoding: str,\n output_encoding: str,\n errors: str = 'replace',\n replacement_char: int = 65533,\n replace_control_characters: bool = False,\n name=None\n ) -\u003e Annotated[Any, _atypes.String]\n\n### Used in the notebooks\n\n| Used in the guide |\n|--------------------------------------------------------------------|\n| - [Unicode strings](https://www.tensorflow.org/text/guide/unicode) |\n\nThe input is a string tensor of any shape. The output is a string tensor of\nthe same shape containing the transcoded strings. Output strings are always\nvalid unicode. If the input contains invalid encoding positions, the\n`errors` attribute sets the policy for how to deal with them. If the default\nerror-handling policy is used, invalid formatting will be substituted in the\noutput by the `replacement_char`. If the errors policy is to `ignore`, any\ninvalid encoding positions in the input are skipped and not included in the\noutput. If it set to `strict` then any invalid formatting will result in an\nInvalidArgument error.\n\nThis operation can be used with `output_encoding = input_encoding` to enforce\ncorrect formatting for inputs even if they are already in the desired encoding.\n\nIf the input is prefixed by a Byte Order Mark needed to determine encoding\n(e.g. if the encoding is UTF-16 and the BOM indicates big-endian), then that\nBOM will be consumed and not emitted into the output. If the input encoding\nis marked with an explicit endianness (e.g. UTF-16-BE), then the BOM is\ninterpreted as a non-breaking-space and is preserved in the output (including\nalways for UTF-8).\n\nThe end result is that if the input is marked as an explicit endianness the\ntranscoding is faithful to all codepoints in the source. If it is not marked\nwith an explicit endianness, the BOM is not considered part of the string itself\nbut as metadata, and so is not preserved in the output.\n\n#### Examples:\n\n tf.strings.unicode_transcode([\"Hello\", \"TensorFlow\", \"2.x\"], \"UTF-8\", \"UTF-16-BE\")\n \u003ctf.Tensor: shape=(3,), dtype=string, numpy=\n array([b'\\x00H\\x00e\\x00l\\x00l\\x00o',\n b'\\x00T\\x00e\\x00n\\x00s\\x00o\\x00r\\x00F\\x00l\\x00o\\x00w',\n b'\\x002\\x00.\\x00x'], dtype=object)\u003e\n tf.strings.unicode_transcode([\"A\", \"B\", \"C\"], \"US ASCII\", \"UTF-8\").numpy()\n array([b'A', b'B', b'C'], dtype=object)\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n| Args ---- ||\n|------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| `input` | A `Tensor` of type `string`. The text to be processed. Can have any shape. |\n| `input_encoding` | A `string`. Text encoding of the input strings. This is any of the encodings supported by ICU ucnv algorithmic converters. Examples: `\"UTF-16\", \"US ASCII\", \"UTF-8\"`. |\n| `output_encoding` | A `string` from: `\"UTF-8\", \"UTF-16-BE\", \"UTF-32-BE\"`. The unicode encoding to use in the output. Must be one of `\"UTF-8\", \"UTF-16-BE\", \"UTF-32-BE\"`. Multi-byte encodings will be big-endian. |\n| `errors` | An optional `string` from: `\"strict\", \"replace\", \"ignore\"`. Defaults to `\"replace\"`. Error handling policy when there is invalid formatting found in the input. The value of 'strict' will cause the operation to produce a InvalidArgument error on any invalid input formatting. A value of 'replace' (the default) will cause the operation to replace any invalid formatting in the input with the `replacement_char` codepoint. A value of 'ignore' will cause the operation to skip any invalid formatting in the input and produce no corresponding output character. |\n| `replacement_char` | An optional `int`. Defaults to `65533`. The replacement character codepoint to be used in place of any invalid formatting in the input when `errors='replace'`. Any valid unicode codepoint may be used. The default value is the default unicode replacement character is 0xFFFD or U+65533.) \u003cbr /\u003e Note that for UTF-8, passing a replacement character expressible in 1 byte, such as ' ', will preserve string alignment to the source since invalid bytes will be replaced with a 1-byte replacement. For UTF-16-BE and UTF-16-LE, any 1 or 2 byte replacement character will preserve byte alignment to the source. |\n| `replace_control_characters` | An optional `bool`. Defaults to `False`. Whether to replace the C0 control characters (00-1F) with the `replacement_char`. Default is false. |\n| `name` | A name for the operation (optional). |\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n| Returns ------- ||\n|---|---|\n| A `Tensor` of type `string`. ||\n\n\u003cbr /\u003e"]]