Package pandas (1.25.0)

API documentation for pandas package.

Classes

ArrowDtype

An ExtensionDtype for PyArrow data types.

BooleanDtype

Extension dtype for boolean data.

.. method:: None :noindex:

.. rubric:: Examples

pd.BooleanDtype() BooleanDtype

DataFrame

Two-dimensional, size-mutable, potentially heterogeneous tabular data.

Data structure also contains labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure.

Float64Dtype

An ExtensionDtype for float64 data.

This dtype uses pd.NA as missing value indicator.

For Float32Dtype:

ser = pd.Series([2.25, pd.NA], dtype=pd.Float32Dtype()) ser.dtype Float32Dtype()

For Float64Dtype:

ser = pd.Series([2.25, pd.NA], dtype=pd.Float64Dtype()) ser.dtype Float64Dtype()

Index

Immutable sequence used for indexing and alignment.

The basic object storing axis labels for all objects.

Int64Dtype

An ExtensionDtype for int64 integer data.

Uses pandas.NA as its missing value, rather than numpy.nan.

For Int8Dtype:

ser = pd.Series([2, pd.NA], dtype=pd.Int8Dtype()) ser.dtype Int8Dtype()

For Int16Dtype:

ser = pd.Series([2, pd.NA], dtype=pd.Int16Dtype()) ser.dtype Int16Dtype()

For Int32Dtype:

ser = pd.Series([2, pd.NA], dtype=pd.Int32Dtype()) ser.dtype Int32Dtype()

For Int64Dtype:

ser = pd.Series([2, pd.NA], dtype=pd.Int64Dtype()) ser.dtype Int64Dtype()

For UInt8Dtype:

ser = pd.Series([2, pd.NA], dtype=pd.UInt8Dtype()) ser.dtype UInt8Dtype()

For UInt16Dtype:

ser = pd.Series([2, pd.NA], dtype=pd.UInt16Dtype()) ser.dtype UInt16Dtype()

For UInt32Dtype:

ser = pd.Series([2, pd.NA], dtype=pd.UInt32Dtype()) ser.dtype UInt32Dtype()

For UInt64Dtype:

ser = pd.Series([2, pd.NA], dtype=pd.UInt64Dtype()) ser.dtype UInt64Dtype()

MultiIndex

A multi-level, or hierarchical, index object for pandas objects.

NamedAgg

NamedAgg(column, aggfunc)

Series

N-dimensional analogue of DataFrame. Store multi-dimensional in a size-mutable, labeled data structure

StringDtype

Extension dtype for string data.

option_context

Context manager to temporarily set thread-local options in the with statement context.

You need to invoke as option_context(pat, val, [(pat, val), ...]).

import bigframes

with bigframes.option_context('display.max_rows', 10, 'display.max_columns', 5): ... pass

Packages Functions

close_session

close_session()

Start a fresh session the next time a function requires a session.

Closes the current session if it was already started, deleting any temporary tables that were created.

concat

Concatenate BigQuery DataFrames objects along a particular axis.

Allows optional set logic along the other axes.

Can also add a layer of hierarchical indexing on the concatenation axis, which may be useful if the labels are the same (or overlapping) on the passed axis number.

Examples:

>>> import bigframes.pandas as pd
>>> pd.options.display.progress_bar = None

Combine two Series.

>>> s1 = pd.Series(['a', 'b'])
>>> s2 = pd.Series(['c', 'd'])
>>> pd.concat([s1, s2])
0    a
1    b
0    c
1    d
dtype: string

Clear the existing index and reset it in the result by setting the ignore_index option to True.

>>> pd.concat([s1, s2], ignore_index=True)
0    a
1    b
2    c
3    d
dtype: string

Combine two DataFrame objects with identical columns.

>>> df1 = pd.DataFrame([['a', 1], ['b', 2]],
...                    columns=['letter', 'number'])
>>> df1
  letter  number
0      a       1
1      b       2
<BLANKLINE>
[2 rows x 2 columns]
>>> df2 = pd.DataFrame([['c', 3], ['d', 4]],
...                    columns=['letter', 'number'])
>>> df2
  letter  number
0      c       3
1      d       4
<BLANKLINE>
[2 rows x 2 columns]
>>> pd.concat([df1, df2])
  letter  number
0      a       1
1      b       2
0      c       3
1      d       4
<BLANKLINE>
[4 rows x 2 columns]

Combine DataFrame objects with overlapping columns and return everything. Columns outside the intersection will be filled with NaN values.

>>> df3 = pd.DataFrame([['c', 3, 'cat'], ['d', 4, 'dog']],
...                    columns=['letter', 'number', 'animal'])
>>> df3
  letter  number animal
0      c       3    cat
1      d       4    dog
<BLANKLINE>
[2 rows x 3 columns]
>>> pd.concat([df1, df3])
  letter  number animal
0      a       1   <NA>
1      b       2   <NA>
0      c       3    cat
1      d       4    dog
<BLANKLINE>
[4 rows x 3 columns]

Combine DataFrame objects with overlapping columns and return only those that are shared by passing inner to the join keyword argument.

>>> pd.concat([df1, df3], join="inner")
  letter  number
0      a       1
1      b       2
0      c       3
1      d       4
<BLANKLINE>
[4 rows x 2 columns]
Parameters
Name Description
objs

Objects to concatenate. Any None objects will be dropped silently unless they are all None in which case a ValueError will be raised.

axis

The axis to concatenate along.

join

How to handle indexes on other axis (or axes).

ignore_index

If True, do not use the index values along the concatenation axis. The resulting axis will be labeled 0, ..., n - 1. This is useful if you are concatenating objects where the concatenation axis does not have meaningful indexing information. Note the index values on the other axes are still respected in the join.

get_global_session

get_global_session()

Gets the global session.

Creates the global session if it does not exist.

merge

merge(
    left: bigframes.dataframe.DataFrame,
    right: bigframes.dataframe.DataFrame,
    how: typing.Literal["inner", "left", "outer", "right", "cross"] = "inner",
    on: typing.Optional[str] = None,
    *,
    left_on: typing.Optional[str] = None,
    right_on: typing.Optional[str] = None,
    sort: bool = False,
    suffixes: tuple[str, str] = ("_x", "_y")
) -> bigframes.dataframe.DataFrame

Merge DataFrame objects with a database-style join.

The join is done on columns or indexes. If joining columns on columns, the DataFrame indexes will be ignored. Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on. When performing a cross merge, no column specifications to merge on are allowed.

Parameters
Name Description
on

Columns to join on. It must be found in both DataFrames. Either on or left_on + right_on must be passed in.

left_on

Columns to join on in the left DataFrame. Either on or left_on + right_on must be passed in.

right_on

Columns to join on in the right DataFrame. Either on or left_on + right_on must be passed in.

read_csv

read_csv(
    filepath_or_buffer: typing.Union[str, typing.IO[bytes]],
    *,
    sep: typing.Optional[str] = ",",
    header: typing.Optional[int] = 0,
    names: typing.Optional[
        typing.Union[
            typing.MutableSequence[typing.Any],
            numpy.ndarray[typing.Any, typing.Any],
            typing.Tuple[typing.Any, ...],
            range,
        ]
    ] = None,
    index_col: typing.Optional[
        typing.Union[
            int,
            str,
            typing.Sequence[typing.Union[str, int]],
            bigframes.enums.DefaultIndexKind,
            typing.Literal[False],
        ]
    ] = None,
    usecols: typing.Optional[
        typing.Union[
            typing.MutableSequence[str],
            typing.Tuple[str, ...],
            typing.Sequence[int],
            pandas.core.series.Series,
            pandas.core.indexes.base.Index,
            numpy.ndarray[typing.Any, typing.Any],
            typing.Callable[[typing.Any], bool],
        ]
    ] = None,
    dtype: typing.Optional[typing.Dict] = None,
    engine: typing.Optional[
        typing.Literal["c", "python", "pyarrow", "python-fwf", "bigquery"]
    ] = None,
    encoding: typing.Optional[str] = None,
    **kwargs
) -> bigframes.dataframe.DataFrame

Loads data from a comma-separated values (csv) file into a DataFrame.

The CSV file data will be persisted as a temporary BigQuery table, which can be automatically recycled after the Session is closed.

Examples:

>>> import bigframes.pandas as bpd
>>> bpd.options.display.progress_bar = None

>>> gcs_path = "gs://cloud-samples-data/bigquery/us-states/us-states.csv"
>>> df = bpd.read_csv(filepath_or_buffer=gcs_path)
>>> df.head(2)
      name post_abbr
0  Alabama        AL
1   Alaska        AK
<BLANKLINE>
[2 rows x 2 columns]
Parameters
Name Description
filepath_or_buffer

A local or Google Cloud Storage (gs://) path with engine="bigquery" otherwise passed to pandas.read_csv.

sep

the separator for fields in a CSV file. For the BigQuery engine, the separator can be any ISO-8859-1 single-byte character. To use a character in the range 128-255, you must encode the character as UTF-8. Both engines support sep=" " to specify tab character as separator. Default engine supports having any number of spaces as separator by specifying sep="\s+". Separators longer than 1 character are interpreted as regular expressions by the default engine. BigQuery engine only supports single character separators.

header

row number to use as the column names. - None: Instructs autodetect that there are no headers and data should be read starting from the first row. - 0: If using engine="bigquery", Autodetect tries to detect headers in the first row. If they are not detected, the row is read as data. Otherwise data is read starting from the second row. When using default engine, pandas assumes the first row contains column names unless the names argument is specified. If names is provided, then the first row is ignored, second row is read as data, and column names are inferred from names. - N > 0: If using engine="bigquery", Autodetect skips N rows and tries to detect headers in row N+1. If headers are not detected, row N+1 is just skipped. Otherwise row N+1 is used to extract column names for the detected schema. When using default engine, pandas will skip N rows and assumes row N+1 contains column names unless the names argument is specified. If names is provided, row N+1 will be ignored, row N+2 will be read as data, and column names are inferred from names.

names

a list of column names to use. If the file contains a header row and you want to pass this parameter, then header=0 should be passed as well so the first (header) row is ignored. Only to be used with default engine.

index_col

column(s) to use as the row labels of the DataFrame, either given as string name or column index. index_col=False can be used with the default engine only to enforce that the first column is not used as the index. Using column index instead of column name is only supported with the default engine. The BigQuery engine only supports having a single column name as the index_col. Neither engine supports having a multi-column index.

usecols

List of column names to use): The BigQuery engine only supports having a list of string column names. Column indices and callable functions are only supported with the default engine. Using the default engine, the column names in usecols can be defined to correspond to column names provided with the names parameter (ignoring the document's header row of column names). The order of the column indices/names in usecols is ignored with the default engine. The order of the column names provided with the BigQuery engine will be consistent in the resulting dataframe. If using a callable function with the default engine, only column names that evaluate to True by the callable function will be in the resulting dataframe.

dtype

Data type for data or columns. Only to be used with default engine.

engine

Type of engine to use. If engine="bigquery" is specified, then BigQuery's load API will be used. Otherwise, the engine will be passed to pandas.read_csv.

encoding

encoding the character encoding of the data. The default encoding is UTF-8 for both engines. The default engine acceps a wide range of encodings. Refer to Python documentation for a comprehensive list, https://docs.python.org/3/library/codecs.html#standard-encodings The BigQuery engine only supports UTF-8 and ISO-8859-1.

read_gbq

read_gbq(
    query_or_table: str,
    *,
    index_col: typing.Union[
        typing.Iterable[str], str, bigframes.enums.DefaultIndexKind
    ] = (),
    columns: typing.Iterable[str] = (),
    configuration: typing.Optional[typing.Dict] = None,
    max_results: typing.Optional[int] = None,
    filters: typing.Union[
        typing.Iterable[
            typing.Tuple[
                str,
                typing.Literal[
                    "in", "not in", "<",><=", "="=" ,"="" "!=", ">=", ">", "LIKE"
                ],
                typing.Any,
            ]
        ],
        typing.Iterable[
            typing.Iterable[
                typing.Tuple[
                    str,
                    typing.Literal[
                        "in", "not in", "<",><=", "="=" ,"="" "!=", ">=", ">", "LIKE"
                    ],
                    typing.Any,
                ]
            ]
        ],
    ] = (),
    use_cache: typing.Optional[bool] = None,
    col_order: typing.Iterable[str] = ()
) -> bigframes.dataframe.DataFrame

Loads a DataFrame from BigQuery.

BigQuery tables are an unordered, unindexed data source. To add support pandas-compatibility, the following indexing options are supported via the index_col parameter:

  • (Empty iterable, default) A default index. Behavior may change. Explicitly set index_col if your application makes use of specific index values.

    If a table has primary key(s), those are used as the index, otherwise a sequential index is generated.

  • (<xref uid="bigframes.enums.DefaultIndexKind.SEQUENTIAL_INT64">bigframes.enums.DefaultIndexKind.SEQUENTIAL_INT64</xref>) Add an arbitrary sequential index and ordering. Warning This uses an analytic windowed operation that prevents filtering push down. Avoid using on large clustered or partitioned tables.
  • (Recommended) Set the index_col argument to one or more columns. Unique values for the row labels are recommended. Duplicate labels are possible, but note that joins on a non-unique index can duplicate rows via pandas-like outer join behavior.

If your query doesn't have an ordering, select GENERATE_UUID() AS rowindex in your SQL and set index_col='rowindex' for the best performance.

Examples:

>>> import bigframes.pandas as bpd
>>> bpd.options.display.progress_bar = None

If the input is a table ID:

>>> df = bpd.read_gbq("bigquery-public-data.ml_datasets.penguins")

Read table path with wildcard suffix and filters:

df = bpd.read_gbq_table("bigquery-public-data.noaa_gsod.gsod19*", filters=[("_table_suffix", ">=", "30"), ("_table_suffix", "<=", "39")])

Preserve ordering in a query input.

>>> df = bpd.read_gbq('''
...    SELECT
...       -- Instead of an ORDER BY clause on the query, use
...       -- ROW_NUMBER() to create an ordered DataFrame.
...       ROW_NUMBER() OVER (ORDER BY AVG(pitchSpeed) DESC)
...         AS rowindex,
...
...       pitcherFirstName,
...       pitcherLastName,
...       AVG(pitchSpeed) AS averagePitchSpeed
...     FROM `bigquery-public-data.baseball.games_wide`
...     WHERE year = 2016
...     GROUP BY pitcherFirstName, pitcherLastName
... ''', index_col="rowindex")
>>> df.head(2)
         pitcherFirstName pitcherLastName  averagePitchSpeed
rowindex
1                Albertin         Chapman          96.514113
2                 Zachary         Britton          94.591039
<BLANKLINE>
[2 rows x 3 columns]

Reading data with columns and filters parameters:

>>> columns = ['pitcherFirstName', 'pitcherLastName', 'year', 'pitchSpeed']
>>> filters = [('year', '==', 2016), ('pitcherFirstName', 'in', ['John', 'Doe']), ('pitcherLastName', 'in', ['Gant']), ('pitchSpeed', '>', 94)]
>>> df = bpd.read_gbq(
...             "bigquery-public-data.baseball.games_wide",
...             columns=columns,
...             filters=filters,
...         )
>>> df.head(1)
  pitcherFirstName pitcherLastName  year  pitchSpeed
0             John            Gant  2016          95
<BLANKLINE>
[1 rows x 4 columns]
Parameters
Name Description
query_or_table

A SQL string to be executed or a BigQuery table to be read. The table must be specified in the format of project.dataset.tablename or dataset.tablename. Can also take wildcard table name, such as project.dataset.table_prefix*. In tha case, will read all the matched table as one DataFrame.

index_col

Name of result column(s) to use for index in results DataFrame. If an empty iterable, such as (), a default index is generated. Do not depend on specific index values in this case. New in bigframes version 1.3.0: If index_cols is not set, the primary key(s) of the table are used as the index. New in bigframes version 1.4.0: Support bigframes.enums.DefaultIndexKind to override default index behavior.

columns

List of BigQuery column names in the desired order for results DataFrame.

configuration

Query config parameters for job processing. For example: configuration = {'query': {'useQueryCache': False}}. For more information see BigQuery REST API Reference https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs#configuration.query__.

max_results

If set, limit the maximum number of rows to fetch from the query results.

filters

To filter out data. Filter syntax: [[(column, op, val), …],…] where op is [==, >, >=, <, <=, !=, in, not in, LIKE]. The innermost tuples are transposed into a set of filters applied through an AND operation. The outer Iterable combines these sets of filters through an OR operation. A single Iterable of tuples can also be used, meaning that no OR operation between set of filters is to be conducted. If using wildcard table suffix in query_or_table, can specify '_table_suffix' pseudo column to filter the tables to be read into the DataFrame.

use_cache

Caches query results if set to True. When None, it behaves as True, but should not be combined with useQueryCache in configuration to avoid conflicts.

col_order

Alias for columns, retained for backwards compatibility.

read_gbq_function

read_gbq_function(function_name: str, is_row_processor: bool = False)

Loads a BigQuery function from BigQuery.

Then it can be applied to a DataFrame or Series.

BigQuery Utils provides many public functions under the bqutil project on Google Cloud Platform project (See: https://github.com/GoogleCloudPlatform/bigquery-utils/tree/master/udfs#using-the-udfs). You can checkout Community UDFs to use community-contributed functions. (See: https://github.com/GoogleCloudPlatform/bigquery-utils/tree/master/udfs/community#community-udfs).

Examples:

>>> import bigframes.pandas as bpd
>>> bpd.options.display.progress_bar = None

Use the cw_lower_case_ascii_only function from Community UDFs.

>>> func = bpd.read_gbq_function("bqutil.fn.cw_lower_case_ascii_only")

You can run it on scalar input. Usually you would do so to verify that it works as expected before applying to all values in a Series.

>>> func('AURÉLIE')
'aurÉlie'

You can apply it to a BigQuery DataFrames Series.

>>> df = bpd.DataFrame({'id': [1, 2, 3], 'name': ['AURÉLIE', 'CÉLESTINE', 'DAPHNÉ']})
>>> df
   id       name
0   1    AURÉLIE
1   2  CÉLESTINE
2   3     DAPHNÉ
<BLANKLINE>
[3 rows x 2 columns]

>>> df1 = df.assign(new_name=df['name'].apply(func))
>>> df1
   id       name   new_name
0   1    AURÉLIE    aurÉlie
1   2  CÉLESTINE  cÉlestine
2   3     DAPHNÉ     daphnÉ
<BLANKLINE>
[3 rows x 3 columns]

You can even use a function with multiple inputs. For example, cw_regexp_replace_5 from Community UDFs.

>>> func = bpd.read_gbq_function("bqutil.fn.cw_regexp_replace_5")
>>> func('TestStr123456', 'Str', 'Cad$', 1, 1)
'TestCad$123456'

>>> df = bpd.DataFrame({
...     "haystack" : ["TestStr123456", "TestStr123456Str", "TestStr123456Str"],
...     "regexp" : ["Str", "Str", "Str"],
...     "replacement" : ["Cad$", "Cad$", "Cad$"],
...     "offset" : [1, 1, 1],
...     "occurrence" : [1, 2, 1]
... })
>>> df
           haystack regexp replacement  offset  occurrence
0     TestStr123456    Str        Cad$       1           1
1  TestStr123456Str    Str        Cad$       1           2
2  TestStr123456Str    Str        Cad$       1           1
<BLANKLINE>
[3 rows x 5 columns]
>>> df.apply(func, axis=1)
0       TestCad$123456
1    TestStr123456Cad$
2    TestCad$123456Str
dtype: string
Parameters
Name Description
function_name

The function's name in BigQuery in the format project_id.dataset_id.function_name, or dataset_id.function_name to load from the default project, or function_name to load from the default project and the dataset associated with the current session.

is_row_processor

Whether the function is a row processor. This is set to True for a function which receives an entire row of a DataFrame as a pandas Series.

read_gbq_model

read_gbq_model(model_name: str)

Loads a BigQuery ML model from BigQuery.

Examples:

>>> import bigframes.pandas as bpd
>>> bpd.options.display.progress_bar = None

Read an existing BigQuery ML model.

>>> model_name = "bigframes-dev.bqml_tutorial.penguins_model"
>>> model = bpd.read_gbq_model(model_name)
Parameter
Name Description
model_name

the model's name in BigQuery in the format project_id.dataset_id.model_id, or just dataset_id.model_id to load from the default project.

read_gbq_query

read_gbq_query(
    query: str,
    *,
    index_col: typing.Union[
        typing.Iterable[str], str, bigframes.enums.DefaultIndexKind
    ] = (),
    columns: typing.Iterable[str] = (),
    configuration: typing.Optional[typing.Dict] = None,
    max_results: typing.Optional[int] = None,
    use_cache: typing.Optional[bool] = None,
    col_order: typing.Iterable[str] = (),
    filters: typing.Union[
        typing.Iterable[
            typing.Tuple[
                str,
                typing.Literal[
                    "in", "not in", "<",><=", "="=" ,"="" "!=", ">=", ">", "LIKE"
                ],
                typing.Any,
            ]
        ],
        typing.Iterable[
            typing.Iterable[
                typing.Tuple[
                    str,
                    typing.Literal[
                        "in", "not in", "<",><=", "="=" ,"="" "!=", ">=", ">", "LIKE"
                    ],
                    typing.Any,
                ]
            ]
        ],
    ] = ()
) -> bigframes.dataframe.DataFrame

Turn a SQL query into a DataFrame.

Note: Because the results are written to a temporary table, ordering by ORDER BY is not preserved. A unique index_col is recommended. Use row_number() over () if there is no natural unique index or you want to preserve ordering.

Examples:

>>> import bigframes.pandas as bpd
>>> bpd.options.display.progress_bar = None

Simple query input:

>>> df = bpd.read_gbq_query('''
...    SELECT
...       pitcherFirstName,
...       pitcherLastName,
...       pitchSpeed,
...    FROM `bigquery-public-data.baseball.games_wide`
... ''')

Preserve ordering in a query input.

>>> df = bpd.read_gbq_query('''
...    SELECT
...       -- Instead of an ORDER BY clause on the query, use
...       -- ROW_NUMBER() to create an ordered DataFrame.
...       ROW_NUMBER() OVER (ORDER BY AVG(pitchSpeed) DESC)
...         AS rowindex,
...
...       pitcherFirstName,
...       pitcherLastName,
...       AVG(pitchSpeed) AS averagePitchSpeed
...     FROM `bigquery-public-data.baseball.games_wide`
...     WHERE year = 2016
...     GROUP BY pitcherFirstName, pitcherLastName
... ''', index_col="rowindex")
>>> df.head(2)
         pitcherFirstName pitcherLastName  averagePitchSpeed
rowindex
1                Albertin         Chapman          96.514113
2                 Zachary         Britton          94.591039
<BLANKLINE>
[2 rows x 3 columns]

See also: Session.read_gbq.

read_gbq_table

read_gbq_table(
    query: str,
    *,
    index_col: typing.Union[
        typing.Iterable[str], str, bigframes.enums.DefaultIndexKind
    ] = (),
    columns: typing.Iterable[str] = (),
    max_results: typing.Optional[int] = None,
    filters: typing.Union[
        typing.Iterable[
            typing.Tuple[
                str,
                typing.Literal[
                    "in", "not in", "<",><=", "="=" ,"="" "!=", ">=", ">", "LIKE"
                ],
                typing.Any,
            ]
        ],
        typing.Iterable[
            typing.Iterable[
                typing.Tuple[
                    str,
                    typing.Literal[
                        "in", "not in", "<",><=", "="=" ,"="" "!=", ">=", ">", "LIKE"
                    ],
                    typing.Any,
                ]
            ]
        ],
    ] = (),
    use_cache: bool = True,
    col_order: typing.Iterable[str] = ()
) -> bigframes.dataframe.DataFrame

Turn a BigQuery table into a DataFrame.

Examples:

>>> import bigframes.pandas as bpd
>>> bpd.options.display.progress_bar = None

Read a whole table, with arbitrary ordering or ordering corresponding to the primary key(s).

>>> df = bpd.read_gbq_table("bigquery-public-data.ml_datasets.penguins")

See also: Session.read_gbq.

read_json

read_json(
    path_or_buf: typing.Union[str, typing.IO[bytes]],
    *,
    orient: typing.Literal[
        "split", "records", "index", "columns", "values", "table"
    ] = "columns",
    dtype: typing.Optional[typing.Dict] = None,
    encoding: typing.Optional[str] = None,
    lines: bool = False,
    engine: typing.Literal["ujson", "pyarrow", "bigquery"] = "ujson",
    **kwargs
) -> bigframes.dataframe.DataFrame

Convert a JSON string to DataFrame object.

Examples:

>>> import bigframes.pandas as bpd
>>> bpd.options.display.progress_bar = None

>>> gcs_path = "gs://bigframes-dev-testing/sample1.json"
>>> df = bpd.read_json(path_or_buf=gcs_path, lines=True, orient="records")
>>> df.head(2)
   id   name
0   1  Alice
1   2    Bob
<BLANKLINE>
[2 rows x 2 columns]
Parameters
Name Description
path_or_buf

A local or Google Cloud Storage (gs://) path with engine="bigquery" otherwise passed to pandas.read_json.

orient

If engine="bigquery" orient only supports "records". Indication of expected JSON string format. Compatible JSON strings can be produced by to_json() with a corresponding orient value. The set of possible orients is: - 'split' : dict like {{index -> [index], columns -> [columns], data -> [values]}} - 'records' : list like [{{column -> value}}, ... , {{column -> value}}] - 'index' : dict like {{index -> {{column -> value}}}} - 'columns' : dict like {{column -> {{index -> value}}}} - 'values' : just the values array

dtype

If True, infer dtypes; if a dict of column to dtype, then use those; if False, then don't infer dtypes at all, applies only to the data. For all orient values except 'table', default is True.

encoding

The encoding to use to decode py3 bytes.

lines

Read the file as a json object per line. If using engine="bigquery" lines only supports True.

engine

Type of engine to use. If engine="bigquery" is specified, then BigQuery's load API will be used. Otherwise, the engine will be passed to pandas.read_json.

read_pandas

read_pandas(
    pandas_dataframe: typing.Union[
        pandas.core.frame.DataFrame,
        pandas.core.series.Series,
        pandas.core.indexes.base.Index,
    ]
)

Loads DataFrame from a pandas DataFrame.

The pandas DataFrame will be persisted as a temporary BigQuery table, which can be automatically recycled after the Session is closed.

Examples:

>>> import bigframes.pandas as bpd
>>> import pandas as pd
>>> bpd.options.display.progress_bar = None

>>> d = {'col1': [1, 2], 'col2': [3, 4]}
>>> pandas_df = pd.DataFrame(data=d)
>>> df = bpd.read_pandas(pandas_df)
>>> df
   col1  col2
0     1     3
1     2     4
<BLANKLINE>
[2 rows x 2 columns]
Parameter
Name Description
pandas_dataframe

a pandas DataFrame/Series/Index object to be loaded.

read_parquet

read_parquet(
    path: typing.Union[str, typing.IO[bytes]], *, engine: str = "auto"
) -> bigframes.dataframe.DataFrame

Load a Parquet object from the file path (local or Cloud Storage), returning a DataFrame.

Examples:

>>> import bigframes.pandas as bpd
>>> bpd.options.display.progress_bar = None

>>> gcs_path = "gs://cloud-samples-data/bigquery/us-states/us-states.parquet"
>>> df = bpd.read_parquet(path=gcs_path, engine="bigquery")
Parameters
Name Description
path

Local or Cloud Storage path to Parquet file.

engine

One of 'auto', 'pyarrow', 'fastparquet', or 'bigquery'. Parquet library to parse the file. If set to 'bigquery', order is not preserved. Default, 'auto'.

read_pickle

read_pickle(
    filepath_or_buffer: FilePath | ReadPickleBuffer,
    compression: CompressionOptions = "infer",
    storage_options: StorageOptions = None,
)

Load pickled BigFrames object (or any object) from file.

Examples:

>>> import bigframes.pandas as bpd
>>> bpd.options.display.progress_bar = None

>>> gcs_path = "gs://bigframes-dev-testing/test_pickle.pkl"
>>> df = bpd.read_pickle(filepath_or_buffer=gcs_path)
Parameters
Name Description
filepath_or_buffer

String, path object (implementing os.PathLike[str]), or file-like object implementing a binary readlines() function. Also accepts URL. URL is not limited to S3 and GCS.

compression

For on-the-fly decompression of on-disk data. If 'infer' and 'filepath_or_buffer' is path-like, then detect compression from the following extensions: '.gz', '.bz2', '.zip', '.xz', '.zst', '.tar', '.tar.gz', '.tar.xz' or '.tar.bz2' (otherwise no compression). If using 'zip' or 'tar', the ZIP file must contain only one data file to be read in. Set to None for no decompression. Can also be a dict with key 'method' set to one of {'zip', 'gzip', 'bz2', 'zstd', 'tar'} and other key-value pairs are forwarded to zipfile.ZipFile, gzip.GzipFile, bz2.BZ2File, zstandard.ZstdDecompressor or tarfile.TarFile, respectively. As an example, the following could be passed for Zstandard decompression using a custom compression dictionary compression={'method': 'zstd', 'dict_data': my_compression_dict}.

storage_options

Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to urllib.request.Request as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded to fsspec.open. Please see fsspec and urllib for more details, and for more examples on storage options refer here.

remote_function

remote_function(
    input_types: typing.Union[None, type, typing.Sequence[type]] = None,
    output_type: typing.Optional[type] = None,
    dataset: typing.Optional[str] = None,
    bigquery_connection: typing.Optional[str] = None,
    reuse: bool = True,
    name: typing.Optional[str] = None,
    packages: typing.Optional[typing.Sequence[str]] = None,
    cloud_function_service_account: typing.Optional[str] = None,
    cloud_function_kms_key_name: typing.Optional[str] = None,
    cloud_function_docker_repository: typing.Optional[str] = None,
    max_batching_rows: typing.Optional[int] = 1000,
    cloud_function_timeout: typing.Optional[int] = 600,
    cloud_function_max_instances: typing.Optional[int] = None,
    cloud_function_vpc_connector: typing.Optional[str] = None,
    cloud_function_memory_mib: typing.Optional[int] = 1024,
    cloud_function_ingress_settings: typing.Literal[
        "all", "internal-only", "internal-and-gclb"
    ] = "all",
)

Decorator to turn a user defined function into a BigQuery remote function. Check out the code samples at: https://cloud.google.com/bigquery/docs/remote-functions#bigquery-dataframes.

  1. Have the below APIs enabled for your project:

    • BigQuery Connection API
    • Cloud Functions API
    • Cloud Run API
    • Cloud Build API
    • Artifact Registry API
    • Cloud Resource Manager API

    This can be done from the cloud console (change PROJECT_ID to yours): https://console.cloud.google.com/apis/enableflow?apiid=bigqueryconnection.googleapis.com,cloudfunctions.googleapis.com,run.googleapis.com,cloudbuild.googleapis.com,artifactregistry.googleapis.com,cloudresourcemanager.googleapis.com&project=PROJECT_ID

    Or from the gcloud CLI:

    $ gcloud services enable bigqueryconnection.googleapis.com cloudfunctions.googleapis.com run.googleapis.com cloudbuild.googleapis.com artifactregistry.googleapis.com cloudresourcemanager.googleapis.com

  2. Have following IAM roles enabled for you:

    • BigQuery Data Editor (roles/bigquery.dataEditor)
    • BigQuery Connection Admin (roles/bigquery.connectionAdmin)
    • Cloud Functions Developer (roles/cloudfunctions.developer)
    • Service Account User (roles/iam.serviceAccountUser) on the service account [email protected]
    • Storage Object Viewer (roles/storage.objectViewer)
    • Project IAM Admin (roles/resourcemanager.projectIamAdmin) (Only required if the bigquery connection being used is not pre-created and is created dynamically with user credentials.)
  3. Either the user has setIamPolicy privilege on the project, or a BigQuery connection is pre-created with necessary IAM role set:

    1. To create a connection, follow https://cloud.google.com/bigquery/docs/reference/standard-sql/remote-functions#create_a_connection
    2. To set up IAM, follow https://cloud.google.com/bigquery/docs/reference/standard-sql/remote-functions#grant_permission_on_function

      Alternatively, the IAM could also be setup via the gcloud CLI:

      $ gcloud projects add-iam-policy-binding PROJECT_ID --member="serviceAccount:CONNECTION_SERVICE_ACCOUNT_ID" --role="roles/run.invoker".

Parameters
Name Description
input_types

For scalar user defined function it should be the input type or sequence of input types. For row processing user defined function, type Series should be specified.

output_type

Data type of the output in the user defined function.

dataset

Dataset in which to create a BigQuery remote function. It should be in <project_id>.<dataset_name> or <dataset_name> format. If this parameter is not provided then session dataset id is used.

bigquery_connection

Name of the BigQuery connection. You should either have the connection already created in the location you have chosen, or you should have the Project IAM Admin role to enable the service to create the connection for you if you need it. If this parameter is not provided then the BigQuery connection from the session is used.

reuse

Reuse the remote function if already exists. True by default, which will result in reusing an existing remote function and corresponding cloud function that was previously created (if any) for the same udf. Please note that for an unnamed (i.e. created without an explicit name argument) remote function, the BigQuery DataFrames session id is attached in the cloud artifacts names. So for the effective reuse across the sessions it is recommended to create the remote function with an explicit name. Setting it to False would force creating a unique remote function. If the required remote function does not exist then it would be created irrespective of this param.

name

Explicit name of the persisted BigQuery remote function. Use it with caution, because more than one users working in the same project and dataset could overwrite each other's remote functions if they use the same persistent name. When an explicit name is provided, any session specific clean up ( bigframes.session.Session.close/ bigframes.pandas.close_session/ bigframes.pandas.reset_session/ bigframes.pandas.clean_up_by_session_id) does not clean up the function, and leaves it for the user to manage the function and the associated cloud function directly.

packages

Explicit name of the external package dependencies. Each dependency is added to the requirements.txt as is, and can be of the form supported in https://pip.pypa.io/en/stable/reference/requirements-file-format/.

cloud_function_service_account

Service account to use for the cloud functions. If not provided then the default service account would be used. See https://cloud.google.com/functions/docs/securing/function-identity for more details. Please make sure the service account has the necessary IAM permissions configured as described in https://cloud.google.com/functions/docs/reference/iam/roles#additional-configuration.

cloud_function_kms_key_name

Customer managed encryption key to protect cloud functions and related data at rest. This is of the format projects/PROJECT_ID/locations/LOCATION/keyRings/KEYRING/cryptoKeys/KEY. Read https://cloud.google.com/functions/docs/securing/cmek for more details including granting necessary service accounts access to the key.

cloud_function_docker_repository

Docker repository created with the same encryption key as cloud_function_kms_key_name to store encrypted artifacts created to support the cloud function. This is of the format projects/PROJECT_ID/locations/LOCATION/repositories/REPOSITORY_NAME. For more details see https://cloud.google.com/functions/docs/securing/cmek#before_you_begin.

max_batching_rows

The maximum number of rows to be batched for processing in the BQ remote function. Default value is 1000. A lower number can be passed to avoid timeouts in case the user code is too complex to process large number of rows fast enough. A higher number can be used to increase throughput in case the user code is fast enough. None can be passed to let BQ remote functions service apply default batching. See for more details https://cloud.google.com/bigquery/docs/remote-functions#limiting_number_of_rows_in_a_batch_request.

cloud_function_timeout

The maximum amount of time (in seconds) BigQuery should wait for the cloud function to return a response. See for more details https://cloud.google.com/functions/docs/configuring/timeout. Please note that even though the cloud function (2nd gen) itself allows seeting up to 60 minutes of timeout, BigQuery remote function can wait only up to 20 minutes, see for more details https://cloud.google.com/bigquery/quotas#remote_function_limits. By default BigQuery DataFrames uses a 10 minute timeout. None can be passed to let the cloud functions default timeout take effect.

cloud_function_max_instances

The maximumm instance count for the cloud function created. This can be used to control how many cloud function instances can be active at max at any given point of time. Lower setting can help control the spike in the billing. Higher setting can help support processing larger scale data. When not specified, cloud function's default setting applies. For more details see https://cloud.google.com/functions/docs/configuring/max-instances.

cloud_function_vpc_connector

The VPC connector you would like to configure for your cloud function. This is useful if your code needs access to data or service(s) that are on a VPC network. See for more details https://cloud.google.com/functions/docs/networking/connecting-vpc.

cloud_function_memory_mib

The amounts of memory (in mebibytes) to allocate for the cloud function (2nd gen) created. This also dictates a corresponding amount of allocated CPU for the function. By default a memory of 1024 MiB is set for the cloud functions created to support BigQuery DataFrames remote function. If you want to let the default memory of cloud functions be allocated, pass None. See for more details https://cloud.google.com/functions/docs/configuring/memory.

cloud_function_ingress_settings

Ingress settings controls dictating what traffic can reach the function. By default all will be used. It must be one of: all, internal-only, internal-and-gclb. See for more details https://cloud.google.com/functions/docs/networking/network-settings#ingress_settings.

reset_session

reset_session()

Start a fresh session the next time a function requires a session.

Closes the current session if it was already started, deleting any temporary tables that were created.

to_datetime

This function converts a scalar, array-like or Series to a datetime object.

Examples:

>>> import bigframes.pandas as bpd
>>> bpd.options.display.progress_bar = None

Converting a Scalar to datetime:

>>> scalar = 123456.789
>>> bpd.to_datetime(scalar, unit = 's')
Timestamp('1970-01-02 10:17:36.789000')

Converting a List of Strings without Timezone Information:

>>> list_str = ["01-31-2021 14:30", "02-28-2021 15:45"]
>>> bpd.to_datetime(list_str, format="%m-%d-%Y %H:%M", utc=True)
0    2021-01-31 14:30:00+00:00
1    2021-02-28 15:45:00+00:00
dtype: timestamp[us, tz=UTC][pyarrow]

Converting a Series of Strings with Timezone Information:

>>> series_str = bpd.Series(["01-31-2021 14:30+08:00", "02-28-2021 15:45+00:00"])
>>> bpd.to_datetime(series_str, format="%m-%d-%Y %H:%M%Z", utc=True)
0    2021-01-31 06:30:00+00:00
1    2021-02-28 15:45:00+00:00
dtype: timestamp[us, tz=UTC][pyarrow]
Parameters
Name Description
arg

The object to convert to a datetime.

utc

Control timezone-related parsing, localization and conversion. If True, the function always returns a timezone-aware UTC-localized timestamp or series. If False (default), inputs will not be coerced to UTC.

format

The strftime to parse time, e.g. "%d/%m/%Y".

unit

The unit of the arg (D,s,ms,us,ns) denote the unit, which is an integer or float number.