Class AIAccessor (2.5.0)

AIAccessor(df)

API documentation for AIAccessor class.

Methods

filter

filter(
    instruction: str,
    model,
    ground_with_google_search: bool = False,
    attach_logprobs: bool = False,
)

Filters the DataFrame with the semantics of the user instruction.

Examples:

>>> import bigframes.pandas as bpd
>>> bpd.options.display.progress_bar = None
>>> bpd.options.experiments.ai_operators = True
>>> bpd.options.compute.ai_ops_confirmation_threshold = 25

>>> import bigframes.ml.llm as llm
>>> model = llm.GeminiTextGenerator(model_name="gemini-2.0-flash-001")

>>> df = bpd.DataFrame({"country": ["USA", "Germany"], "city": ["Seattle", "Berlin"]})
>>> df.ai.filter("{city} is the capital of {country}", model)
   country    city
1  Germany  Berlin
<BLANKLINE>
[1 rows x 2 columns]
Parameters
Name Description
instruction str

An instruction on how to filter the data. This value must contain column references by name, which should be wrapped in a pair of braces. For example, if you have a column "food", you can refer to this column in the instructions like: "The {food} is healthy."

model bigframes.ml.llm.GeminiTextGenerator

A GeminiTextGenerator provided by Bigframes ML package.

ground_with_google_search bool, default False

Enables Grounding with Google Search for the GeminiTextGenerator model. When set to True, the model incorporates relevant information from Google Search results into its responses, enhancing their accuracy and factualness. Note: Using this feature may impact billing costs. Refer to the pricing page for details: https://cloud.google.com/vertex-ai/generative-ai/pricing#google_models The default is False.

attach_logprobs bool, default False

Controls whether to attach an additional "logprob" column for each result. Logprobs are float-point values reflecting the confidence level of the LLM for their responses. Higher values indicate more confidence. The value is in the range between negative infinite and 0.

Exceptions
Type Description
NotImplementedError when the AI operator experiment is off.
ValueError when the instruction refers to a non-existing column, or when no columns are referred to.
Returns
Type Description
bigframes.pandas.DataFrame DataFrame filtered by the instruction.

join

join(
    other,
    instruction: str,
    model,
    ground_with_google_search: bool = False,
    attach_logprobs=False,
)

Joines two dataframes by applying the instruction over each pair of rows from the left and right table.

Examples:

>>> import bigframes.pandas as bpd
>>> bpd.options.display.progress_bar = None
>>> bpd.options.experiments.ai_operators = True
>>> bpd.options.compute.ai_ops_confirmation_threshold = 25

>>> import bigframes.ml.llm as llm
>>> model = llm.GeminiTextGenerator(model_name="gemini-2.0-flash-001")

>>> cities = bpd.DataFrame({'city': ['Seattle', 'Ottawa', 'Berlin', 'Shanghai', 'New Delhi']})
>>> continents = bpd.DataFrame({'continent': ['North America', 'Africa', 'Asia']})

>>> cities.ai.join(continents, "{city} is in {continent}", model)
        city      continent
0    Seattle  North America
1     Ottawa  North America
2   Shanghai           Asia
3  New Delhi           Asia
<BLANKLINE>
[4 rows x 2 columns]
Parameters
Name Description
other bigframes.pandas.DataFrame

The other dataframe.

instruction str

An instruction on how left and right rows can be joined. This value must contain column references by name. which should be wrapped in a pair of braces. For example: "The {city} belongs to the {country}". For column names that are shared between two dataframes, you need to add "left." and "right." prefix for differentiation. This is especially important when you do self joins. For example: "The {left.employee_name} reports to {right.employee_name}" For unique column names, this prefix is optional.

model bigframes.ml.llm.GeminiTextGenerator

A GeminiTextGenerator provided by Bigframes ML package.

ground_with_google_search bool, default False

Enables Grounding with Google Search for the GeminiTextGenerator model. When set to True, the model incorporates relevant information from Google Search results into its responses, enhancing their accuracy and factualness. Note: Using this feature may impact billing costs. Refer to the pricing page for details: https://cloud.google.com/vertex-ai/generative-ai/pricing#google_models The default is False.

attach_logprobs bool, default False

Controls whether to attach an additional "logprob" column for each result. Logprobs are float-point values reflecting the confidence level of the LLM for their responses. Higher values indicate more confidence. The value is in the range between negative infinite and 0.

Exceptions
Type Description
ValueErro if the amount of data that will be sent for LLM processing is larger than max_rows.:
Returns
Type Description
bigframes.pandas.DataFrame The joined dataframe.

map

map(
    instruction: str,
    model,
    output_schema: typing.Optional[typing.Dict[str, str]] = None,
    ground_with_google_search: bool = False,
    attach_logprobs=False,
)

Maps the DataFrame with the semantics of the user instruction.

Examples:

>>> import bigframes.pandas as bpd
>>> bpd.options.display.progress_bar = None
>>> bpd.options.experiments.ai_operators = True
>>> bpd.options.compute.ai_ops_confirmation_threshold = 25

>>> import bigframes.ml.llm as llm
>>> model = llm.GeminiTextGenerator(model_name="gemini-2.0-flash-001")

>>> df = bpd.DataFrame({"ingredient_1": ["Burger Bun", "Soy Bean"], "ingredient_2": ["Beef Patty", "Bittern"]})
>>> df.ai.map("What is the food made from {ingredient_1} and {ingredient_2}? One word only.", model=model, output_schema={"food": "string"})
  ingredient_1 ingredient_2      food
0   Burger Bun   Beef Patty  Burger
<BLANKLINE>
1     Soy Bean      Bittern    Tofu
<BLANKLINE>
<BLANKLINE>
[2 rows x 3 columns]
Parameters
Name Description
instruction str

An instruction on how to map the data. This value must contain column references by name, which should be wrapped in a pair of braces. For example, if you have a column "food", you can refer to this column in the instructions like: "Get the ingredients of {food}."

model bigframes.ml.llm.GeminiTextGenerator

A GeminiTextGenerator provided by Bigframes ML package.

output_schema Dict[str, str] or None, default None

The schema used to generate structured output as a bigframes DataFrame. The schema is a string key-value pair of <column_name>:

ground_with_google_search bool, default False

Enables Grounding with Google Search for the GeminiTextGenerator model. When set to True, the model incorporates relevant information from Google Search results into its responses, enhancing their accuracy and factualness. Note: Using this feature may impact billing costs. Refer to the pricing page for details: https://cloud.google.com/vertex-ai/generative-ai/pricing#google_models The default is False.

attach_logprobs bool, default False

Controls whether to attach an additional "logprob" column for each result. Logprobs are float-point values reflecting the confidence level of the LLM for their responses. Higher values indicate more confidence. The value is in the range between negative infinite and 0.

Exceptions
Type Description
NotImplementedError when the AI operator experiment is off.
ValueError when the instruction refers to a non-existing column, or when no columns are referred to.
Returns
Type Description
bigframes.pandas.DataFrame DataFrame with attached mapping results.
search(
    search_column: str,
    query: str,
    top_k: int,
    model,
    score_column: typing.Optional[str] = None,
)

Performs AI semantic search on the DataFrame.

** Examples: **

>>> import bigframes.pandas as bpd
>>> bpd.options.display.progress_bar = None

>>> import bigframes
>>> bigframes.options.experiments.ai_operators = True
>>> bpd.options.compute.ai_ops_confirmation_threshold = 25

>>> import bigframes.ml.llm as llm
>>> model = llm.TextEmbeddingGenerator(model_name="text-embedding-005")

>>> df = bpd.DataFrame({"creatures": ["salmon", "sea urchin", "frog", "chimpanzee"]})
>>> df.ai.search("creatures", "monkey", top_k=1, model=model, score_column='distance')
    creatures  distance
3  chimpanzee  0.635844
<BLANKLINE>
[1 rows x 2 columns]
Parameters
Name Description
query str

The search query.

top_k int

The number of nearest neighbors to return.

model TextEmbeddingGenerator

A TextEmbeddingGenerator provided by Bigframes ML package.

score_column Optional[str], default None

The name of the the additional column containning the similarity scores. If None, this column won't be attached to the result.

Exceptions
Type Description
ValueError when the search_column is not found from the the data frame.
TypeError when the provided model is not TextEmbeddingGenerator.
Returns
Type Description
DataFrame the DataFrame with the search result.

sim_join

sim_join(
    other,
    left_on: str,
    right_on: str,
    model,
    top_k: int = 3,
    score_column: typing.Optional[str] = None,
    max_rows: int = 1000,
)

Joins two dataframes based on the similarity of the specified columns.

This method uses BigQuery's VECTOR_SEARCH function to match rows on the left side with the rows that have nearest embedding vectors on the right. In the worst case scenario, the complexity is around O(M * N * log K). Therefore, this is a potentially expensive operation.

** Examples: **

>>> import bigframes.pandas as bpd
>>> bpd.options.display.progress_bar = None
>>> bpd.options.experiments.ai_operators = True
>>> bpd.options.compute.ai_ops_confirmation_threshold = 25

>>> import bigframes.ml.llm as llm
>>> model = llm.TextEmbeddingGenerator(model_name="text-embedding-005")

>>> df1 = bpd.DataFrame({'animal': ['monkey', 'spider']})
>>> df2 = bpd.DataFrame({'animal': ['scorpion', 'baboon']})

>>> df1.ai.sim_join(df2, left_on='animal', right_on='animal', model=model, top_k=1)
animal  animal_1
0  monkey    baboon
1  spider  scorpion
<BLANKLINE>
[2 rows x 2 columns]
Parameters
Name Description
other DataFrame

The other data frame to join with.

left_on str

The name of the column on left side for the join.

right_on str

The name of the column on the right side for the join.

top_k int, default 3

The number of nearest neighbors to return.

model TextEmbeddingGenerator

A TextEmbeddingGenerator provided by Bigframes ML package.

score_column Optional[str], default None

The name of the the additional column containning the similarity scores. If None, this column won't be attached to the result.

Exceptions
Type Description
ValueError when the amount of data to be processed exceeds the specified max_rows.
Returns
Type Description
DataFrame the data frame with the join result.

top_k

top_k(
    instruction: str, model, k: int = 10, ground_with_google_search: bool = False
)

Ranks each tuple and returns the k best according to the instruction.

This method employs a quick select algorithm to efficiently compare the pivot with all other items. By leveraging an LLM (Large Language Model), it then identifies the top 'k' best answers from these comparisons.

Examples:

>>> import bigframes.pandas as bpd
>>> bpd.options.display.progress_bar = None
>>> bpd.options.experiments.ai_operators = True
>>> bpd.options.compute.ai_ops_confirmation_threshold = 25

>>> import bigframes.ml.llm as llm
>>> model = llm.GeminiTextGenerator(model_name="gemini-2.0-flash-001")

>>> df = bpd.DataFrame(
... {
...     "Animals": ["Dog", "Bird", "Cat", "Horse"],
...     "Sounds": ["Woof", "Chirp", "Meow", "Neigh"],
... })
>>> df.ai.top_k("{Animals} are more popular as pets", model=model, k=2)
  Animals Sounds
0     Dog   Woof
2     Cat   Meow
<BLANKLINE>
[2 rows x 2 columns]
Parameters
Name Description
instruction str

An instruction on how to map the data. This value must contain column references by name enclosed in braces. For example, to reference a column named "Animals", use "{Animals}" in the instruction, like: "{Animals} are more popular as pets"

model bigframes.ml.llm.GeminiTextGenerator

A GeminiTextGenerator provided by the Bigframes ML package.

k int, default 10

The number of rows to return.

ground_with_google_search bool, default False

Enables Grounding with Google Search for the GeminiTextGenerator model. When set to True, the model incorporates relevant information from Google Search results into its responses, enhancing their accuracy and factualness. Note: Using this feature may impact billing costs. Refer to the pricing page for details: https://cloud.google.com/vertex-ai/generative-ai/pricing#google_models The default is False.

Exceptions
Type Description
NotImplementedError when the AI operator experiment is off.
ValueError when the instruction refers to a non-existing column, or when no columns are referred to.
Returns
Type Description
bigframes.dataframe.DataFrame A new DataFrame with the top k rows.