A 2D SparseTensor representing int64 values (most likely that are the
result of calling compute_and_apply_vocabulary on a tokenized string).
vocab_size
An int - the count of vocab used to turn the string into int64s
including any OOV buckets.
smooth
A bool indicating if the inverse document frequency should be
smoothed. If True, which is the default, then the idf is calculated as
1 + log((corpus size + 1) / (document frequency of term + 1)).
Otherwise, the idf is
1 +log((corpus size) / (document frequency of term)), which could
result in a division by zero error.
name
(Optional) A name for this operation.
Returns
Two SparseTensors with indices [index_in_batch, index_in_bag_of_words].
The first has values vocab_index, which is taken from input x.
The second has values tfidf_weight.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Missing the information I need","missingTheInformationINeed","thumb-down"],["Too complicated / too many steps","tooComplicatedTooManySteps","thumb-down"],["Out of date","outOfDate","thumb-down"],["Samples / code issue","samplesCodeIssue","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2024-11-01 UTC."],[],[],null,["# tft.tfidf\n\n\u003cbr /\u003e\n\n|-------------------------------------------------------------------------------------------------------------------------|\n| [View source on GitHub](https://github.com/tensorflow/transform/blob/v1.16.0/tensorflow_transform/mappers.py#L677-L774) |\n\nMaps the terms in x to their term frequency \\* inverse document frequency. \n\n tft.tfidf(\n x: tf.SparseTensor,\n vocab_size: int,\n smooth: bool = True,\n name: Optional[str] = None\n ) -\u003e Tuple[tf.SparseTensor, tf.SparseTensor]\n\nThe term frequency of a term in a document is calculated as\n(count of term in document) / (document size)\n\nThe inverse document frequency of a term is, by default, calculated as\n1 + log((corpus size + 1) / (count of documents containing term + 1)).\n\n#### Example usage:\n\n def preprocessing_fn(inputs):\n integerized = tft.compute_and_apply_vocabulary(inputs['x'])\n vocab_size = tft.get_num_buckets_for_transformed_feature(integerized)\n vocab_index, tfidf_weight = tft.tfidf(integerized, vocab_size)\n return {\n 'index': vocab_index,\n 'tf_idf': tfidf_weight,\n 'integerized': integerized,\n }\n raw_data = [dict(x=[\"I\", \"like\", \"pie\", \"pie\", \"pie\"]),\n dict(x=[\"yum\", \"yum\", \"pie\"])]\n feature_spec = dict(x=tf.io.VarLenFeature(tf.string))\n raw_data_metadata = tft.DatasetMetadata.from_feature_spec(feature_spec)\n with tft_beam.Context(temp_dir=tempfile.mkdtemp()):\n transformed_dataset, transform_fn = (\n (raw_data, raw_data_metadata)\n | tft_beam.AnalyzeAndTransformDataset(preprocessing_fn))\n transformed_data, transformed_metadata = transformed_dataset\n transformed_data\n [{'index': array([0, 2, 3]), 'integerized': array([3, 2, 0, 0, 0]),\n 'tf_idf': array([0.6, 0.28109303, 0.28109303], dtype=float32)},\n {'index': array([0, 1]), 'integerized': array([1, 1, 0]),\n 'tf_idf': array([0.33333334, 0.9369768 ], dtype=float32)}]\n\n example strings: [[\"I\", \"like\", \"pie\", \"pie\", \"pie\"], [\"yum\", \"yum\", \"pie]]\n in: SparseTensor(indices=[[0, 0], [0, 1], [0, 2], [0, 3], [0, 4],\n [1, 0], [1, 1], [1, 2]],\n values=[1, 2, 0, 0, 0, 3, 3, 0])\n out: SparseTensor(indices=[[0, 0], [0, 1], [0, 2], [1, 0], [1, 1]],\n values=[1, 2, 0, 3, 0])\n SparseTensor(indices=[[0, 0], [0, 1], [0, 2], [1, 0], [1, 1]],\n values=[(1/5)*(log(3/2)+1), (1/5)*(log(3/2)+1), (3/5),\n (2/3)*(log(3/2)+1), (1/3)]\n\n| **Note:** the first doc's duplicate \"pie\" strings have been combined to one output, as have the second doc's duplicate \"yum\" strings.\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n| Args ---- ||\n|--------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| `x` | A 2D `SparseTensor` representing int64 values (most likely that are the result of calling `compute_and_apply_vocabulary` on a tokenized string). |\n| `vocab_size` | An int - the count of vocab used to turn the string into int64s including any OOV buckets. |\n| `smooth` | A bool indicating if the inverse document frequency should be smoothed. If True, which is the default, then the idf is calculated as 1 + log((corpus size + 1) / (document frequency of term + 1)). Otherwise, the idf is 1 +log((corpus size) / (document frequency of term)), which could result in a division by zero error. |\n| `name` | (Optional) A name for this operation. |\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n| Returns ------- ||\n|---|---|\n| Two `SparseTensor`s with indices \\[index_in_batch, index_in_bag_of_words\\]. The first has values vocab_index, which is taken from input `x`. The second has values tfidf_weight. ||\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n\u003cbr /\u003e\n\n| Raises ------ ||\n|---|---|\n| ValueError if `x` does not have 2 dimensions. ||\n\n\u003cbr /\u003e"]]