- 2.25.0 (latest)
- 2.24.0
- 2.23.0
- 2.22.0
- 2.21.0
- 2.20.0
- 2.19.0
- 2.18.0
- 2.17.0
- 2.16.0
- 2.15.0
- 2.14.0
- 2.13.0
- 2.12.0
- 2.11.0
- 2.10.0
- 2.9.0
- 2.8.0
- 2.7.0
- 2.6.0
- 2.5.0
- 2.4.0
- 2.3.0
- 2.2.0
- 1.36.0
- 1.35.0
- 1.34.0
- 1.33.0
- 1.32.0
- 1.31.0
- 1.30.0
- 1.29.0
- 1.28.0
- 1.27.0
- 1.26.0
- 1.25.0
- 1.24.0
- 1.22.0
- 1.21.0
- 1.20.0
- 1.19.0
- 1.18.0
- 1.17.0
- 1.16.0
- 1.15.0
- 1.14.0
- 1.13.0
- 1.12.0
- 1.11.1
- 1.10.0
- 1.9.0
- 1.8.0
- 1.7.0
- 1.6.0
- 1.5.0
- 1.4.0
- 1.3.0
- 1.2.0
- 1.1.0
- 1.0.0
- 0.26.0
- 0.25.0
- 0.24.0
- 0.23.0
- 0.22.0
- 0.21.0
- 0.20.1
- 0.19.2
- 0.18.0
- 0.17.0
- 0.16.0
- 0.15.0
- 0.14.1
- 0.13.0
- 0.12.0
- 0.11.0
- 0.10.0
- 0.9.0
- 0.8.0
- 0.7.0
- 0.6.0
- 0.5.0
- 0.4.0
- 0.3.0
- 0.2.0
API documentation for pandas package.
Classes
NamedAgg
NamedAgg(column, aggfunc)
Packages Functions
concat
Concatenate BigQuery DataFrames objects along a particular axis.
Allows optional set logic along the other axes.
Can also add a layer of hierarchical indexing on the concatenation axis, which may be useful if the labels are the same (or overlapping) on the passed axis number.
| Parameters | |
|---|---|
| Name | Description | 
| axis | The axis to concatenate along. | 
| join | How to handle indexes on other axis (or axes). | 
| ignore_index | If True, do not use the index values along the concatenation axis. The resulting axis will be labeled 0, ..., n - 1. This is useful if you are concatenating objects where the concatenation axis does not have meaningful indexing information. Note the index values on the other axes are still respected in the join. | 
cut
cut(
    x: bigframes.series.Series, bins: int, *, labels: typing.Optional[bool] = None
) -> bigframes.series.SeriesBin values into discrete intervals.
Use cut when you need to segment and sort data values into bins. This
function is also useful for going from a continuous variable to a
categorical variable. For example, cut could convert ages to groups of
age ranges. Supports binning into an equal number of bins, or a
pre-specified array of bins.
labels=False implies you just want the bins back.
Examples:
import bigframes.pandas as pd
pd.options.display.progress_bar = None
s = pd.Series([0, 1, 1, 2])
pd.cut(s, bins=4, labels=False)
0    0
1    1
2    1
3    3
dtype: Int64
| Parameters | |
|---|---|
| Name | Description | 
| x | The input Series to be binned. Must be 1-dimensional. | 
| bins | The criteria to bin by. int : Defines the number of equal-width bins in the range of  | 
| labels | Specifies the labels for the returned bins. Must be the same length as the resulting bins. If False, returns only integer indicators of the bins. This affects the type of the output container (see below). If True, raises an error. When  | 
merge
merge(
    left: bigframes.dataframe.DataFrame,
    right: bigframes.dataframe.DataFrame,
    how: typing.Literal["inner", "left", "outer", "right"] = "inner",
    on: typing.Optional[str] = None,
    *,
    left_on: typing.Optional[str] = None,
    right_on: typing.Optional[str] = None,
    sort: bool = False,
    suffixes: tuple[str, str] = ("_x", "_y")
) -> bigframes.dataframe.DataFrameMerge DataFrame objects with a database-style join.
The join is done on columns or indexes. If joining columns on columns, the DataFrame indexes will be ignored. Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on. When performing a cross merge, no column specifications to merge on are allowed.
| Parameters | |
|---|---|
| Name | Description | 
| on | Columns to join on. It must be found in both DataFrames. Either on or left_on + right_on must be passed in. | 
| left_on | Columns to join on in the left DataFrame. Either on or left_on + right_on must be passed in. | 
| right_on | Columns to join on in the right DataFrame. Either on or left_on + right_on must be passed in. | 
read_csv
read_csv(
    filepath_or_buffer: str | IO["bytes"],
    *,
    sep: Optional[str] = ",",
    header: Optional[int] = 0,
    names: Optional[
        Union[MutableSequence[Any], numpy.ndarray[Any, Any], Tuple[Any, ...], range]
    ] = None,
    index_col: Optional[
        Union[int, str, Sequence[Union[str, int]], Literal[False]]
    ] = None,
    usecols: Optional[
        Union[
            MutableSequence[str],
            Tuple[str, ...],
            Sequence[int],
            pandas.Series,
            pandas.Index,
            numpy.ndarray[Any, Any],
            Callable[[Any], bool],
        ]
    ] = None,
    dtype: Optional[Dict] = None,
    engine: Optional[
        Literal["c", "python", "pyarrow", "python-fwf", "bigquery"]
    ] = None,
    encoding: Optional[str] = None,
    **kwargs
) -> bigframes.dataframe.DataFrameLoads DataFrame from comma-separated values (csv) file locally or from Cloud Storage.
The CSV file data will be persisted as a temporary BigQuery table, which can be automatically recycled after the Session is closed.
Examples:>>> import bigframes.pandas as bpd
>>> bpd.options.display.progress_bar = None
>>> gcs_path = "gs://cloud-samples-data/bigquery/us-states/us-states.csv"
>>> df = bpd.read_csv(filepath_or_buffer=gcs_path)
>>> df.head(2)
      name post_abbr
0  Alabama        AL
1   Alaska        AK
<BLANKLINE>
[2 rows x 2 columns]
| Parameters | |
|---|---|
| Name | Description | 
| filepath_or_buffer | A local or Google Cloud Storage ( | 
| sep | the separator for fields in a CSV file. For the BigQuery engine, the separator can be any ISO-8859-1 single-byte character. To use a character in the range 128-255, you must encode the character as UTF-8. Both engines support  | 
| header | row number to use as the column names. -  | 
| names | a list of column names to use. If the file contains a header row and you want to pass this parameter, then  | 
| index_col | column(s) to use as the row labels of the DataFrame, either given as string name or column index.  | 
| usecols | List of column names to use): The BigQuery engine only supports having a list of string column names. Column indices and callable functions are only supported with the default engine. Using the default engine, the column names in  | 
| dtype | Data type for data or columns. Only to be used with default engine. | 
| engine | Type of engine to use. If  | 
| encoding | encoding the character encoding of the data. The default encoding is  | 
read_gbq
read_gbq(
    query_or_table: str,
    *,
    index_col: Iterable[str] | str = (),
    col_order: Iterable[str] = (),
    max_results: Optional[int] = None
) -> bigframes.dataframe.DataFrameLoads a DataFrame from BigQuery.
BigQuery tables are an unordered, unindexed data source. By default, the DataFrame will have an arbitrary index and ordering.
Set the index_col argument to one or more columns to choose an
index. The resulting DataFrame is sorted by the index columns. For the
best performance, ensure the index columns don't contain duplicate
values.
GENERATE_UUID() AS
    rowindex in your SQL and set index_col='rowindex' for the
    best performance.
Examples:
>>> import bigframes.pandas as bpd
>>> bpd.options.display.progress_bar = None
If the input is a table ID:
>>> df = bpd.read_gbq("bigquery-public-data.ml_datasets.penguins")
>>> df.head(2)
                                     species island  culmen_length_mm  \
0        Adelie Penguin (Pygoscelis adeliae)  Dream              36.6
1        Adelie Penguin (Pygoscelis adeliae)  Dream              39.8
<BLANKLINE>
   culmen_depth_mm  flipper_length_mm  body_mass_g     sex
0             18.4              184.0       3475.0  FEMALE
1             19.1              184.0       4650.0    MALE
<BLANKLINE>
[2 rows x 7 columns]
Preserve ordering in a query input.
>>> df = bpd.read_gbq('''
...    SELECT
...       -- Instead of an ORDER BY clause on the query, use
...       -- ROW_NUMBER() to create an ordered DataFrame.
...       ROW_NUMBER() OVER (ORDER BY AVG(pitchSpeed) DESC)
...         AS rowindex,
...
...       pitcherFirstName,
...       pitcherLastName,
...       AVG(pitchSpeed) AS averagePitchSpeed
...     FROM `bigquery-public-data.baseball.games_wide`
...     WHERE year = 2016
...     GROUP BY pitcherFirstName, pitcherLastName
... ''', index_col="rowindex")
>>> df.head(2)
         pitcherFirstName pitcherLastName  averagePitchSpeed
rowindex
1                Albertin         Chapman          96.514113
2                 Zachary         Britton          94.591039
<BLANKLINE>
[2 rows x 3 columns]
| Parameters | |
|---|---|
| Name | Description | 
| query_or_table | A SQL string to be executed or a BigQuery table to be read. The table must be specified in the format of  | 
| index_col | Name of result column(s) to use for index in results DataFrame. | 
| col_order | List of BigQuery column names in the desired order for results DataFrame. | 
| max_results | If set, limit the maximum number of rows to fetch from the query results. | 
read_gbq_function
read_gbq_function(function_name: str)Loads a BigQuery function from BigQuery.
Then it can be applied to a DataFrame or Series.
Examples:import bigframes.pandas as bpd bpd.options.display.progress_bar = None
function_name = "bqutil.fn.cw_lower_case_ascii_only" func = bpd.read_gbq_function(function_name=function_name) func.bigframes_remote_function 'bqutil.fn.cw_lower_case_ascii_only'
| Parameter | |
|---|---|
| Name | Description | 
| function_name | the function's name in BigQuery in the format  | 
read_gbq_model
read_gbq_model(model_name: str)Loads a BigQuery ML model from BigQuery.
Examples:
>>> import bigframes.pandas as bpd
>>> bpd.options.display.progress_bar = None
Read an existing BigQuery ML model.
>>> model_name = "bigframes-dev.bqml_tutorial.penguins_model"
>>> model = bpd.read_gbq_model(model_name)
| Parameter | |
|---|---|
| Name | Description | 
| model_name | the model's name in BigQuery in the format  | 
read_gbq_query
read_gbq_query(
    query: str,
    *,
    index_col: Iterable[str] | str = (),
    col_order: Iterable[str] = (),
    max_results: Optional[int] = None
) -> bigframes.dataframe.DataFrameTurn a SQL query into a DataFrame.
Note: Because the results are written to a temporary table, ordering by
ORDER BY is not preserved. A unique index_col is recommended. Use
row_number() over () if there is no natural unique index or you
want to preserve ordering.
Examples:
>>> import bigframes.pandas as bpd
>>> bpd.options.display.progress_bar = None
Simple query input:
>>> df = bpd.read_gbq_query('''
...    SELECT
...       pitcherFirstName,
...       pitcherLastName,
...       pitchSpeed,
...    FROM `bigquery-public-data.baseball.games_wide`
... ''')
>>> df.head(2)
  pitcherFirstName pitcherLastName  pitchSpeed
0                                            0
1                                            0
<BLANKLINE>
[2 rows x 3 columns]
Preserve ordering in a query input.
>>> df = bpd.read_gbq_query('''
...    SELECT
...       -- Instead of an ORDER BY clause on the query, use
...       -- ROW_NUMBER() to create an ordered DataFrame.
...       ROW_NUMBER() OVER (ORDER BY AVG(pitchSpeed) DESC)
...         AS rowindex,
...
...       pitcherFirstName,
...       pitcherLastName,
...       AVG(pitchSpeed) AS averagePitchSpeed
...     FROM `bigquery-public-data.baseball.games_wide`
...     WHERE year = 2016
...     GROUP BY pitcherFirstName, pitcherLastName
... ''', index_col="rowindex")
>>> df.head(2)
         pitcherFirstName pitcherLastName  averagePitchSpeed
rowindex
1                Albertin         Chapman          96.514113
2                 Zachary         Britton          94.591039
<BLANKLINE>
[2 rows x 3 columns]
See also: Session.read_gbq.
read_gbq_table
read_gbq_table(
    query: str,
    *,
    index_col: Iterable[str] | str = (),
    col_order: Iterable[str] = (),
    max_results: Optional[int] = None
) -> bigframes.dataframe.DataFrameTurn a BigQuery table into a DataFrame.
Examples:
>>> import bigframes.pandas as bpd
>>> bpd.options.display.progress_bar = None
Read a whole table, with arbitrary ordering or ordering corresponding to the primary key(s).
>>> df = bpd.read_gbq_table("bigquery-public-data.ml_datasets.penguins")
>>> df.head(2)
                                     species island  culmen_length_mm  \
0        Adelie Penguin (Pygoscelis adeliae)  Dream              36.6
1        Adelie Penguin (Pygoscelis adeliae)  Dream              39.8
<BLANKLINE>
   culmen_depth_mm  flipper_length_mm  body_mass_g     sex
0             18.4              184.0       3475.0  FEMALE
1             19.1              184.0       4650.0    MALE
<BLANKLINE>
[2 rows x 7 columns]
See also: Session.read_gbq.
read_json
read_json(
    path_or_buf: str | IO["bytes"],
    *,
    orient: Literal[
        "split", "records", "index", "columns", "values", "table"
    ] = "columns",
    dtype: Optional[Dict] = None,
    encoding: Optional[str] = None,
    lines: bool = False,
    engine: Literal["ujson", "pyarrow", "bigquery"] = "ujson",
    **kwargs
) -> bigframes.dataframe.DataFrameConvert a JSON string to DataFrame object.
Examples:>>> import bigframes.pandas as bpd
>>> bpd.options.display.progress_bar = None
>>> gcs_path = "gs://bigframes-dev-testing/sample1.json"
>>> df = bpd.read_json(path_or_buf=gcs_path, lines=True, orient="records")
>>> df.head(2)
   id   name
0   1  Alice
1   2    Bob
<BLANKLINE>
[2 rows x 2 columns]
| Parameters | |
|---|---|
| Name | Description | 
| path_or_buf | A local or Google Cloud Storage ( | 
| orient | If  | 
| dtype | If True, infer dtypes; if a dict of column to dtype, then use those; if False, then don't infer dtypes at all, applies only to the data. For all  | 
| encoding | The encoding to use to decode py3 bytes. | 
| lines | Read the file as a json object per line. If using  | 
| engine | Type of engine to use. If  | 
read_pandas
read_pandas(
    pandas_dataframe: pandas.core.frame.DataFrame,
) -> bigframes.dataframe.DataFrameLoads DataFrame from a pandas DataFrame.
The pandas DataFrame will be persisted as a temporary BigQuery table, which can be automatically recycled after the Session is closed.
Examples:
>>> import bigframes.pandas as bpd
>>> import pandas as pd
>>> bpd.options.display.progress_bar = None
>>> d = {'col1': [1, 2], 'col2': [3, 4]}
>>> pandas_df = pd.DataFrame(data=d)
>>> df = bpd.read_pandas(pandas_df)
>>> df
   col1  col2
0     1     3
1     2     4
<BLANKLINE>
[2 rows x 2 columns]
| Parameter | |
|---|---|
| Name | Description | 
| pandas_dataframe | a pandas DataFrame object to be loaded. | 
read_parquet
read_parquet(path: str | IO["bytes"]) -> bigframes.dataframe.DataFrameLoad a Parquet object from the file path (local or Cloud Storage), returning a DataFrame.
Examples:>>> import bigframes.pandas as bpd
>>> bpd.options.display.progress_bar = None
>>> gcs_path = "gs://cloud-samples-data/bigquery/us-states/us-states.parquet"
>>> df = bpd.read_parquet(path=gcs_path)
>>> df.head(2)
      name post_abbr
0  Alabama        AL
1   Alaska        AK
<BLANKLINE>
[2 rows x 2 columns]
| Parameter | |
|---|---|
| Name | Description | 
| path | Local or Cloud Storage path to Parquet file. | 
read_pickle
read_pickle(
    filepath_or_buffer: FilePath | ReadPickleBuffer,
    compression: CompressionOptions = "infer",
    storage_options: StorageOptions = None,
)Load pickled BigFrames object (or any object) from file.
Examples:>>> import bigframes.pandas as bpd
>>> bpd.options.display.progress_bar = None
>>> gcs_path = "gs://bigframes-dev-testing/test_pickle.pkl"
>>> df = bpd.read_pickle(filepath_or_buffer=gcs_path)
>>> df.head(2)
                                     species island  culmen_length_mm  \
0        Adelie Penguin (Pygoscelis adeliae)  Dream              36.6
1        Adelie Penguin (Pygoscelis adeliae)  Dream              39.8
<BLANKLINE>
   culmen_depth_mm  flipper_length_mm  body_mass_g     sex
0             18.4              184.0       3475.0  FEMALE
1             19.1              184.0       4650.0    MALE
<BLANKLINE>
[2 rows x 7 columns]
| Parameters | |
|---|---|
| Name | Description | 
| filepath_or_buffer | String, path object (implementing os.PathLike[str]), or file-like object implementing a binary readlines() function. Also accepts URL. URL is not limited to S3 and GCS. | 
| compression | For on-the-fly decompression of on-disk data. If 'infer' and 'filepath_or_buffer' is path-like, then detect compression from the following extensions: '.gz', '.bz2', '.zip', '.xz', '.zst', '.tar', '.tar.gz', '.tar.xz' or '.tar.bz2' (otherwise no compression). If using 'zip' or 'tar', the ZIP file must contain only one data file to be read in. Set to None for no decompression. Can also be a dict with key 'method' set to one of {'zip', 'gzip', 'bz2', 'zstd', 'tar'} and other key-value pairs are forwarded to zipfile.ZipFile, gzip.GzipFile, bz2.BZ2File, zstandard.ZstdDecompressor or tarfile.TarFile, respectively. As an example, the following could be passed for Zstandard decompression using a custom compression dictionary compression={'method': 'zstd', 'dict_data': my_compression_dict}. | 
| storage_options | Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to urllib.request.Request as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded to fsspec.open. Please see fsspec and urllib for more details, and for more examples on storage options refer here. | 
remote_function
remote_function(
    input_types: typing.List[type],
    output_type: type,
    dataset: typing.Optional[str] = None,
    bigquery_connection: typing.Optional[str] = None,
    reuse: bool = True,
    name: typing.Optional[str] = None,
    packages: typing.Optional[typing.Sequence[str]] = None,
)Decorator to turn a user defined function into a BigQuery remote function. Check out the code samples at: https://cloud.google.com/bigquery/docs/remote-functions#bigquery-dataframes.
- Have the below APIs enabled for your project: - BigQuery Connection API
- Cloud Functions API
- Cloud Run API
- Cloud Build API
- Artifact Registry API
- Cloud Resource Manager API
 - This can be done from the cloud console (change - PROJECT_IDto yours): https://console.cloud.google.com/apis/enableflow?apiid=bigqueryconnection.googleapis.com,cloudfunctions.googleapis.com,run.googleapis.com,cloudbuild.googleapis.com,artifactregistry.googleapis.com,cloudresourcemanager.googleapis.com&project=PROJECT_ID- Or from the gcloud CLI: - $ gcloud services enable bigqueryconnection.googleapis.com cloudfunctions.googleapis.com run.googleapis.com cloudbuild.googleapis.com artifactregistry.googleapis.com cloudresourcemanager.googleapis.com
- Have following IAM roles enabled for you: - BigQuery Data Editor (roles/bigquery.dataEditor)
- BigQuery Connection Admin (roles/bigquery.connectionAdmin)
- Cloud Functions Developer (roles/cloudfunctions.developer)
- Service Account User (roles/iam.serviceAccountUser) on the service account [email protected]
- Storage Object Viewer (roles/storage.objectViewer)
- Project IAM Admin (roles/resourcemanager.projectIamAdmin) (Only required if the bigquery connection being used is not pre-created and is created dynamically with user credentials.)
 
- Either the user has setIamPolicy privilege on the project, or a BigQuery connection is pre-created with necessary IAM role set: - To create a connection, follow https://cloud.google.com/bigquery/docs/reference/standard-sql/remote-functions#create_a_connection
- To set up IAM, follow https://cloud.google.com/bigquery/docs/reference/standard-sql/remote-functions#grant_permission_on_function - Alternatively, the IAM could also be setup via the gcloud CLI: - $ gcloud projects add-iam-policy-binding PROJECT_ID --member="serviceAccount:CONNECTION_SERVICE_ACCOUNT_ID" --role="roles/run.invoker".
 
| Parameters | |
|---|---|
| Name | Description | 
| input_types | List of input data types in the user defined function. | 
| output_type | Data type of the output in the user defined function. | 
| dataset | Dataset in which to create a BigQuery remote function. It should be in  | 
| bigquery_connection | Name of the BigQuery connection. You should either have the connection already created in the  | 
| reuse | Reuse the remote function if already exists.  | 
| name | Explicit name of the persisted BigQuery remote function. Use it with caution, because two users working in the same project and dataset could overwrite each other's remote functions if they use the same persistent name. | 
| packages | Explicit name of the external package dependencies. Each dependency is added to the  |