tf.strings.ngrams
Stay organized with collections
Save and categorize content based on your preferences.
Create a tensor of n-grams based on data
.
tf.strings.ngrams(
data,
ngram_width,
separator=' ',
pad_values=None,
padding_width=None,
preserve_short_sequences=False,
name=None
)
Creates a tensor of n-grams based on data
. The n-grams are created by
joining windows of width
adjacent strings from the inner axis of data
using separator
.
The input data can be padded on both the start and end of the sequence, if
desired, using the pad_values
argument. If set, pad_values
should contain
either a tuple of strings or a single string; the 0th element of the tuple
will be used to pad the left side of the sequence and the 1st element of the
tuple will be used to pad the right side of the sequence. The padding_width
arg controls how many padding values are added to each side; it defaults to
ngram_width-1
.
If this op is configured to not have padding, or if it is configured to add
padding with padding_width
set to less than ngram_width-1, it is possible
that a sequence, or a sequence plus padding, is smaller than the ngram
width. In that case, no ngrams will be generated for that sequence. This can
be prevented by setting preserve_short_sequences
, which will cause the op
to always generate at least one ngram per non-empty sequence.
Examples:
tf.strings.ngrams(["A", "B", "C", "D"], 2).numpy()
array([b'A B', b'B C', b'C D'], dtype=object)
tf.strings.ngrams(["TF", "and", "keras"], 1).numpy()
array([b'TF', b'and', b'keras'], dtype=object)
Args |
data
|
A Tensor or RaggedTensor containing the source data for the ngrams.
|
ngram_width
|
The width(s) of the ngrams to create. If this is a list or
tuple, the op will return ngrams of all specified arities in list order.
Values must be non-Tensor integers greater than 0.
|
separator
|
The separator string used between ngram elements. Must be a
string constant, not a Tensor.
|
pad_values
|
A tuple of (left_pad_value, right_pad_value), a single string,
or None. If None, no padding will be added; if a single string, then that
string will be used for both left and right padding. Values must be Python
strings.
|
padding_width
|
If set, padding_width pad values will be added to both
sides of each sequence. Defaults to ngram_width -1. Must be greater than
- (Note that 1-grams are never padded, regardless of this value.)
|
preserve_short_sequences
|
If true, then ensure that at least one ngram is
generated for each input sequence. In particular, if an input sequence is
shorter than min(ngram_width) + 2*pad_width , then generate a single
ngram containing the entire sequence. If false, then no ngrams are
generated for these short input sequences.
|
name
|
The op name.
|
Returns |
A RaggedTensor of ngrams. If data.shape=[D1...DN, S] , then
output.shape=[D1...DN, NUM_NGRAMS] , where
NUM_NGRAMS=S-ngram_width+1+2*padding_width .
|
Raises |
TypeError
|
if pad_values is set to an invalid type.
|
ValueError
|
if pad_values , padding_width , or ngram_width is set to an
invalid value.
|
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates. Some content is licensed under the numpy license.
Last updated 2024-04-26 UTC.
[null,null,["Last updated 2024-04-26 UTC."],[],[],null,["# tf.strings.ngrams\n\n\u003cbr /\u003e\n\n|--------------------------------------------------------------------------------------------------------------------------------------------|\n| [View source on GitHub](https://github.com/tensorflow/tensorflow/blob/v2.16.1/tensorflow/python/ops/ragged/ragged_string_ops.py#L672-L824) |\n\nCreate a tensor of n-grams based on `data`.\n\n#### View aliases\n\n\n**Compat aliases for migration**\n\nSee\n[Migration guide](https://www.tensorflow.org/guide/migrate) for\nmore details.\n\n[`tf.compat.v1.strings.ngrams`](https://www.tensorflow.org/api_docs/python/tf/strings/ngrams)\n\n\u003cbr /\u003e\n\n tf.strings.ngrams(\n data,\n ngram_width,\n separator=' ',\n pad_values=None,\n padding_width=None,\n preserve_short_sequences=False,\n name=None\n )\n\nCreates a tensor of n-grams based on `data`. The n-grams are created by\njoining windows of `width` adjacent strings from the inner axis of `data`\nusing `separator`.\n\nThe input data can be padded on both the start and end of the sequence, if\ndesired, using the `pad_values` argument. If set, `pad_values` should contain\neither a tuple of strings or a single string; the 0th element of the tuple\nwill be used to pad the left side of the sequence and the 1st element of the\ntuple will be used to pad the right side of the sequence. The `padding_width`\narg controls how many padding values are added to each side; it defaults to\n`ngram_width-1`.\n\nIf this op is configured to not have padding, or if it is configured to add\npadding with `padding_width` set to less than ngram_width-1, it is possible\nthat a sequence, or a sequence plus padding, is smaller than the ngram\nwidth. In that case, no ngrams will be generated for that sequence. This can\nbe prevented by setting `preserve_short_sequences`, which will cause the op\nto always generate at least one ngram per non-empty sequence.\n\n#### Examples:\n\n tf.strings.ngrams([\"A\", \"B\", \"C\", \"D\"], 2).numpy()\n array([b'A B', b'B C', b'C D'], dtype=object)\n tf.strings.ngrams([\"TF\", \"and\", \"keras\"], 1).numpy()\n array([b'TF', b'and', b'keras'], dtype=object)\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n| Args ---- ||\n|----------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| `data` | A Tensor or RaggedTensor containing the source data for the ngrams. |\n| `ngram_width` | The width(s) of the ngrams to create. If this is a list or tuple, the op will return ngrams of all specified arities in list order. Values must be non-Tensor integers greater than 0. |\n| `separator` | The separator string used between ngram elements. Must be a string constant, not a Tensor. |\n| `pad_values` | A tuple of (left_pad_value, right_pad_value), a single string, or None. If None, no padding will be added; if a single string, then that string will be used for both left and right padding. Values must be Python strings. |\n| `padding_width` | If set, `padding_width` pad values will be added to both sides of each sequence. Defaults to `ngram_width`-1. Must be greater than \u003cbr /\u003e 1. (Note that 1-grams are never padded, regardless of this value.) |\n| `preserve_short_sequences` | If true, then ensure that at least one ngram is generated for each input sequence. In particular, if an input sequence is shorter than `min(ngram_width) + 2*pad_width`, then generate a single ngram containing the entire sequence. If false, then no ngrams are generated for these short input sequences. |\n| `name` | The op name. |\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n| Returns ------- ||\n|---|---|\n| A RaggedTensor of ngrams. If `data.shape=[D1...DN, S]`, then `output.shape=[D1...DN, NUM_NGRAMS]`, where `NUM_NGRAMS=S-ngram_width+1+2*padding_width`. ||\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n| Raises ------ ||\n|--------------|--------------------------------------------------------------------------------|\n| `TypeError` | if `pad_values` is set to an invalid type. |\n| `ValueError` | if `pad_values`, `padding_width`, or `ngram_width` is set to an invalid value. |\n\n\u003cbr /\u003e"]]