tft.experimental.idf
Stay organized with collections
Save and categorize content based on your preferences.
Maps the terms in x to their inverse document frequency in the same order.
tft.experimental.idf(
x: tf.SparseTensor,
vocab_size: int,
smooth: bool = True,
add_baseline: bool = True,
name: Optional[str] = None
) -> tf.SparseTensor
The inverse document frequency of a term, by default, is calculated as
1 + log ((corpus size + 1) / (count of documents containing term + 1)).
Example usage:
def preprocessing_fn(inputs):
integerized = tft.compute_and_apply_vocabulary(inputs['x'])
vocab_size = tft.get_num_buckets_for_transformed_feature(integerized)
idf_weights = tft.experimental.idf(integerized, vocab_size)
return {
'idf': idf_weights,
'integerized': integerized,
}
raw_data = [dict(x=["I", "like", "pie", "pie", "pie"]),
dict(x=["yum", "yum", "pie"])]
feature_spec = dict(x=tf.io.VarLenFeature(tf.string))
raw_data_metadata = tft.DatasetMetadata.from_feature_spec(feature_spec)
with tft_beam.Context(temp_dir=tempfile.mkdtemp()):
transformed_dataset, transform_fn = (
(raw_data, raw_data_metadata)
| tft_beam.AnalyzeAndTransformDataset(preprocessing_fn))
transformed_data, transformed_metadata = transformed_dataset
# 1 + log(3/2) = 1.4054651
transformed_data
[{'idf': array([1.4054651, 1.4054651, 1., 1., 1.], dtype=float32),
'integerized': array([3, 2, 0, 0, 0])},
{'idf': array([1.4054651, 1.4054651, 1.], dtype=float32),
'integerized': array([1, 1, 0])}]
example strings: [["I", "like", "pie", "pie", "pie"], ["yum", "yum", "pie]]
in: SparseTensor(indices=[[0, 0], [0, 1], [0, 2], [0, 3], [0, 4],
[1, 0], [1, 1], [1, 2]],
values=[1, 2, 0, 0, 0, 3, 3, 0])
out: SparseTensor(indices=[[0, 0], [0, 1], [0, 2], [0, 3], [0, 4],
[1, 0], [1, 1], [1, 2]],
values=[1 + log(3/2), 1 + log(3/2), 1, 1, 1,
1 + log(3/2), 1 + log(3/2), 1])
Args |
x
|
A 2D SparseTensor representing int64 values (most likely that are the
result of calling compute_and_apply_vocabulary on a tokenized string).
|
vocab_size
|
An int - the count of vocab used to turn the string into int64s
including any OOV buckets.
|
smooth
|
A bool indicating if the inverse document frequency should be
smoothed. If True, which is the default, then the idf is calculated as 1 +
log((corpus size + 1) / (document frequency of term + 1)). Otherwise, the
idf is 1 + log((corpus size) / (document frequency of term)), which could
result in a division by zero error.
|
add_baseline
|
A bool indicating if the inverse document frequency should be
added with a constant baseline 1.0. If True, which is the default, then
the idf is calculated as 1 + log(). Otherwise, the idf is log() without
the constant 1 baseline. Keeping the baseline reduces the discrepancy in
idf between commonly seen terms and rare terms.
|
name
|
(Optional) A name for this operation.
|
Returns |
SparseTensor s with indices [index_in_batch, index_in_local_sequence] and
values inverse document frequency. Same shape as the input x .
|
Raises |
ValueError if x does not have 2 dimensions.
|
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2024-11-01 UTC.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Missing the information I need","missingTheInformationINeed","thumb-down"],["Too complicated / too many steps","tooComplicatedTooManySteps","thumb-down"],["Out of date","outOfDate","thumb-down"],["Samples / code issue","samplesCodeIssue","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2024-11-01 UTC."],[],[],null,["# tft.experimental.idf\n\n\u003cbr /\u003e\n\n|--------------------------------------------------------------------------------------------------------------------------------------|\n| [View source on GitHub](https://github.com/tensorflow/transform/blob/v1.16.0/tensorflow_transform/experimental/mappers.py#L207-L310) |\n\nMaps the terms in x to their inverse document frequency in the same order. \n\n tft.experimental.idf(\n x: tf.SparseTensor,\n vocab_size: int,\n smooth: bool = True,\n add_baseline: bool = True,\n name: Optional[str] = None\n ) -\u003e tf.SparseTensor\n\nThe inverse document frequency of a term, by default, is calculated as\n1 + log ((corpus size + 1) / (count of documents containing term + 1)).\n\n#### Example usage:\n\n def preprocessing_fn(inputs):\n integerized = tft.compute_and_apply_vocabulary(inputs['x'])\n vocab_size = tft.get_num_buckets_for_transformed_feature(integerized)\n idf_weights = tft.experimental.idf(integerized, vocab_size)\n return {\n 'idf': idf_weights,\n 'integerized': integerized,\n }\n raw_data = [dict(x=[\"I\", \"like\", \"pie\", \"pie\", \"pie\"]),\n dict(x=[\"yum\", \"yum\", \"pie\"])]\n feature_spec = dict(x=tf.io.VarLenFeature(tf.string))\n raw_data_metadata = tft.DatasetMetadata.from_feature_spec(feature_spec)\n with tft_beam.Context(temp_dir=tempfile.mkdtemp()):\n transformed_dataset, transform_fn = (\n (raw_data, raw_data_metadata)\n | tft_beam.AnalyzeAndTransformDataset(preprocessing_fn))\n transformed_data, transformed_metadata = transformed_dataset\n # 1 + log(3/2) = 1.4054651\n transformed_data\n [{'idf': array([1.4054651, 1.4054651, 1., 1., 1.], dtype=float32),\n 'integerized': array([3, 2, 0, 0, 0])},\n {'idf': array([1.4054651, 1.4054651, 1.], dtype=float32),\n 'integerized': array([1, 1, 0])}]\n\n example strings: [[\"I\", \"like\", \"pie\", \"pie\", \"pie\"], [\"yum\", \"yum\", \"pie]]\n in: SparseTensor(indices=[[0, 0], [0, 1], [0, 2], [0, 3], [0, 4],\n [1, 0], [1, 1], [1, 2]],\n values=[1, 2, 0, 0, 0, 3, 3, 0])\n out: SparseTensor(indices=[[0, 0], [0, 1], [0, 2], [0, 3], [0, 4],\n [1, 0], [1, 1], [1, 2]],\n values=[1 + log(3/2), 1 + log(3/2), 1, 1, 1,\n 1 + log(3/2), 1 + log(3/2), 1])\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n| Args ---- ||\n|----------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| `x` | A 2D `SparseTensor` representing int64 values (most likely that are the result of calling `compute_and_apply_vocabulary` on a tokenized string). |\n| `vocab_size` | An int - the count of vocab used to turn the string into int64s including any OOV buckets. |\n| `smooth` | A bool indicating if the inverse document frequency should be smoothed. If True, which is the default, then the idf is calculated as 1 + log((corpus size + 1) / (document frequency of term + 1)). Otherwise, the idf is 1 + log((corpus size) / (document frequency of term)), which could result in a division by zero error. |\n| `add_baseline` | A bool indicating if the inverse document frequency should be added with a constant baseline 1.0. If True, which is the default, then the idf is calculated as 1 + log(*). Otherwise, the idf is log(*) without the constant 1 baseline. Keeping the baseline reduces the discrepancy in idf between commonly seen terms and rare terms. |\n| `name` | (Optional) A name for this operation. |\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n| Returns ------- ||\n|---|---|\n| `SparseTensor`s with indices \\[index_in_batch, index_in_local_sequence\\] and values inverse document frequency. Same shape as the input `x`. ||\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n| Raises ------ ||\n|---|---|\n| ValueError if `x` does not have 2 dimensions. ||\n\n\u003cbr /\u003e"]]