tf.raw_ops.UnicodeDecode
Stay organized with collections
Save and categorize content based on your preferences.
Decodes each string in input
into a sequence of Unicode code points.
tf.raw_ops.UnicodeDecode(
input,
input_encoding,
errors='replace',
replacement_char=65533,
replace_control_characters=False,
Tsplits=tf.dtypes.int64
,
name=None
)
The character codepoints for all strings are returned using a single vector
char_values
, with strings expanded to characters in row-major order.
The row_splits
tensor indicates where the codepoints for
each input string begin and end within the char_values
tensor.
In particular, the values for the i
th
string (in row-major order) are stored in the slice
[row_splits[i]:row_splits[i+1]]
. Thus:
char_values[row_splits[i]+j]
is the Unicode codepoint for the j
th
character in the i
th string (in row-major order).
row_splits[i+1] - row_splits[i]
is the number of characters in the i
th
string (in row-major order).
Args |
input
|
A Tensor of type string .
The text to be decoded. Can have any shape. Note that the output is flattened
to a vector of char values.
|
input_encoding
|
A string .
Text encoding of the input strings. This is any of the encodings supported
by ICU ucnv algorithmic converters. Examples: "UTF-16", "US ASCII", "UTF-8" .
|
errors
|
An optional string from: "strict", "replace", "ignore" . Defaults to "replace" .
Error handling policy when there is invalid formatting found in the input.
The value of 'strict' will cause the operation to produce a InvalidArgument
error on any invalid input formatting. A value of 'replace' (the default) will
cause the operation to replace any invalid formatting in the input with the
replacement_char codepoint. A value of 'ignore' will cause the operation to
skip any invalid formatting in the input and produce no corresponding output
character.
|
replacement_char
|
An optional int . Defaults to 65533 .
The replacement character codepoint to be used in place of any invalid
formatting in the input when errors='replace' . Any valid unicode codepoint may
be used. The default value is the default unicode replacement character is
0xFFFD or U+65533.)
|
replace_control_characters
|
An optional bool . Defaults to False .
Whether to replace the C0 control characters (00-1F) with the
replacement_char . Default is false.
|
Tsplits
|
An optional tf.DType from: tf.int32, tf.int64 . Defaults to tf.int64 .
|
name
|
A name for the operation (optional).
|
Returns |
A tuple of Tensor objects (row_splits, char_values).
|
row_splits
|
A Tensor of type Tsplits .
|
char_values
|
A Tensor of type int32 .
|
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates. Some content is licensed under the numpy license.
Last updated 2024-04-26 UTC.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Missing the information I need","missingTheInformationINeed","thumb-down"],["Too complicated / too many steps","tooComplicatedTooManySteps","thumb-down"],["Out of date","outOfDate","thumb-down"],["Samples / code issue","samplesCodeIssue","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2024-04-26 UTC."],[],[],null,["# tf.raw_ops.UnicodeDecode\n\n\u003cbr /\u003e\n\nDecodes each string in `input` into a sequence of Unicode code points.\n\n#### View aliases\n\n\n**Compat aliases for migration**\n\nSee\n[Migration guide](https://www.tensorflow.org/guide/migrate) for\nmore details.\n\n[`tf.compat.v1.raw_ops.UnicodeDecode`](https://www.tensorflow.org/api_docs/python/tf/raw_ops/UnicodeDecode)\n\n\u003cbr /\u003e\n\n tf.raw_ops.UnicodeDecode(\n input,\n input_encoding,\n errors='replace',\n replacement_char=65533,\n replace_control_characters=False,\n Tsplits=../../tf/dtypes#int64,\n name=None\n )\n\nThe character codepoints for all strings are returned using a single vector\n`char_values`, with strings expanded to characters in row-major order.\n\nThe `row_splits` tensor indicates where the codepoints for\neach input string begin and end within the `char_values` tensor.\nIn particular, the values for the `i`th\nstring (in row-major order) are stored in the slice\n`[row_splits[i]:row_splits[i+1]]`. Thus:\n\n- `char_values[row_splits[i]+j]` is the Unicode codepoint for the `j`th character in the `i`th string (in row-major order).\n- `row_splits[i+1] - row_splits[i]` is the number of characters in the `i`th string (in row-major order).\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n| Args ---- ||\n|------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| `input` | A `Tensor` of type `string`. The text to be decoded. Can have any shape. Note that the output is flattened to a vector of char values. |\n| `input_encoding` | A `string`. Text encoding of the input strings. This is any of the encodings supported by ICU ucnv algorithmic converters. Examples: `\"UTF-16\", \"US ASCII\", \"UTF-8\"`. |\n| `errors` | An optional `string` from: `\"strict\", \"replace\", \"ignore\"`. Defaults to `\"replace\"`. Error handling policy when there is invalid formatting found in the input. The value of 'strict' will cause the operation to produce a InvalidArgument error on any invalid input formatting. A value of 'replace' (the default) will cause the operation to replace any invalid formatting in the input with the `replacement_char` codepoint. A value of 'ignore' will cause the operation to skip any invalid formatting in the input and produce no corresponding output character. |\n| `replacement_char` | An optional `int`. Defaults to `65533`. The replacement character codepoint to be used in place of any invalid formatting in the input when `errors='replace'`. Any valid unicode codepoint may be used. The default value is the default unicode replacement character is 0xFFFD or U+65533.) |\n| `replace_control_characters` | An optional `bool`. Defaults to `False`. Whether to replace the C0 control characters (00-1F) with the `replacement_char`. Default is false. |\n| `Tsplits` | An optional [`tf.DType`](../../tf/dtypes/DType) from: `tf.int32, tf.int64`. Defaults to [`tf.int64`](../../tf#int64). |\n| `name` | A name for the operation (optional). |\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n| Returns ------- ||\n|---------------|-------------------------------|\n| A tuple of `Tensor` objects (row_splits, char_values). ||\n| `row_splits` | A `Tensor` of type `Tsplits`. |\n| `char_values` | A `Tensor` of type `int32`. |\n\n\u003cbr /\u003e"]]