0% found this document useful (0 votes)
78 views14 pages

Schema Design Best Practices - Bigtable Documentation - Google Cloud

This document provides best practices for designing schemas in Google Cloud Bigtable. It recommends storing related datasets in a single table to optimize for row-range reads. Column families should group related columns, and there should be no more than 100 column families per table. Row keys are the most important design element as they determine how data is organized and accessed. The schema should distribute reads and writes evenly across the row space.

Uploaded by

drunion
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
78 views14 pages

Schema Design Best Practices - Bigtable Documentation - Google Cloud

This document provides best practices for designing schemas in Google Cloud Bigtable. It recommends storing related datasets in a single table to optimize for row-range reads. Column families should group related columns, and there should be no more than 100 column families per table. Row keys are the most important design element as they determine how data is organized and accessed. The schema should distribute reads and writes evenly across the row space.

Uploaded by

drunion
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

3/18/24, 9:39 AM Schema design best practices | Bigtable Documentation | Google Cloud

Schema design best practices


This page contains information about Bigtable schema design. Before you read this page, you
should be familiar with the overview of Bigtable (/bigtable/docs/overview). The following topics
are covered on this page:

General concepts (#general-concepts): Basic concepts to keep in mind as you design your
schema.

Best practices (#best-practices): Design guidelines that apply to most use cases, broken
down by table component.

Special use cases (#special-use-cases): Recommendations for some specific use cases
and data patterns.

General concepts
Designing a Bigtable schema is different from designing a schema for a relational database. A
Bigtable schema is defined by application logic rather than by a schema definition object or
file. You can add column families to a table when you create or update the table, but columns
and row key patterns are defined by the data that you write to the table.

In Bigtable, a schema is a blueprint or model of a table, including the structure of the following
table components:

Row keys

Column families, including their garbage collection policies

Columns

Key Point: Design your schema for the queries that you plan to use.

In Bigtable, schema design is driven primarily by the queries, or read requests, that you plan to
send to the table. Because reading a row range (/bigtable/docs/reads#row-range) is the fastest
way to read your Bigtable data, the recommendations on this page are designed to help you
optimize for row range reads. In most cases, that means sending a query based on row key
prefixes.

https://cloud.google.com/bigtable/docs/schema-design 1/14
3/18/24, 9:39 AM Schema design best practices | Bigtable Documentation | Google Cloud

A secondary consideration is the avoidance of hotspots – to prevent hotspots, you need to


consider write patterns and how you can avoid accessing a small key space in a short amount
of time.

The following general concepts apply to Bigtable schema design:

Bigtable is a key/value store, not a relational store. It does not support joins, and
transactions are supported only within a single row.

Each table has only one index, the row key. There are no secondary indexes. Each row
key must be unique.

Rows are sorted lexicographically by row key, from the lowest to the highest byte string.
Row keys are sorted in big-endian byte order (sometimes called network byte order), the
binary equivalent of alphabetical order.

Column families are not stored in any specific order.

Columns are grouped by column family and sorted in lexicographic order within the
column family. For example, in a column family called SysMonitor with column qualifiers
of ProcessName, User, %CPU, ID, Memory, DiskRead, and Priority, Bigtable stores the
columns in this order:

SysMonitor

%CPU DiskRead ID Memory Priority ProcessName User

The intersection of a row and column can contain multiple timestamped cells. Each cell
contains a unique, timestamped version of the data for that row and column.

All operations are atomic at the row level. An operation affects either an entire row or
none of the row.

Ideally, both reads and writes should be distributed evenly across the row space of a
table.

Bigtable tables are sparse. A column doesn't take up any space in a row that doesn't use
the column.

Best practices

https://cloud.google.com/bigtable/docs/schema-design 2/14
3/18/24, 9:39 AM Schema design best practices | Bigtable Documentation | Google Cloud

A good schema results in excellent performance and scalability, and a poorly designed schema
can lead to a poorly performing system. Every use case is different and requires its own
design, but the following best practices apply to most use cases. Exceptions are noted.

Starting at the table level and working down to the row key level, the following sections
describe the best practices for schema design:

Tables (#tables)

Column families (#column-families)

Columns (#columns)

Rows (#rows)

Cells (#cells)

Row keys (#row-keys)

All table elements, especially row keys, should be designed with planned read requests in
mind. Check quotas and limits (/bigtable/quotas#limits-data-size) for recommended and hard size
limits for all table elements.

Because all tables in an instance are stored on the same tablets


(/bigtable/docs/overview#architecture), a schema design that results in hotspots in one table can
affect the latency of other tables in the same instance. Hotspots are caused by frequently
accessing one part of the table in a short period of time.

Tables
Store datasets with similar schemas in the same table, rather than in separate tables.

In other database systems, you might choose to store data in multiple tables based on the
subject and number of columns. In Bigtable, however, it's usually better to store all your data in
one table. You can assign a unique row key prefix to use for each dataset, so that Bigtable
stores the related data in a contiguous range of rows that you can then query by row key prefix.

Bigtable has a limit of 1,000 tables per instance, but usually you should have far fewer tables.
Avoid creating a large number of tables for the following reasons:

Sending requests to many different tables can increase backend connection overhead,
resulting in increased tail latency.

https://cloud.google.com/bigtable/docs/schema-design 3/14
3/18/24, 9:39 AM Schema design best practices | Bigtable Documentation | Google Cloud

Having multiple tables of different sizes can disrupt the behind-the-scenes load
balancing that makes Bigtable function well.

You might justifiably want a separate table for a different use case that requires a different
schema, but you shouldn't use separate tables for similar data. For example, you shouldn't
create a new table because it's a new year or you have a new customer.

Column families
Put related columns in the same column family. When a row contains multiple values that are
related to one another, it's a good practice to group the columns that contain those values in
the same column family. Group data as closely as you can to avoid needing to design complex
filters and so you get just the information that you need, but no more, in your most frequent
read requests.

Create up to about 100 column families per table. Creating more than 100 column families
may cause performance degradation.

Choose short names for your column families. Names are included in the data that is
transferred for each request.

Put columns that have different data retention needs in different column families. This
practice is important if you want to limit storage costs. Garbage collection policies are set at
the column family level, not at the column level. For example, if you only need to keep the most
recent version of a particular piece of data, don't store it in a column family that is set to store
1,000 versions of something else. Otherwise you're paying to store 999 cells of data that you
don't need.

Columns
(Optional) Treat column qualifiers as data. Since you have to store a column qualifier for every
column, you can save space by naming the column with a value. As an example, consider a
table that stores data about friendships in a Friends column family. Each row represents a
person and all their friendships. Each column qualifier can be the ID of a friend. Then the value
for each column in that row can be the social circle the friend is in. In this example, rows might
look like this:

Row key Column qualifier:value Column qualifier:value Column qualifier:value

https://cloud.google.com/bigtable/docs/schema-design 4/14
3/18/24, 9:39 AM Schema design best practices | Bigtable Documentation | Google Cloud

Jose Fred:book-club Gabriel:work Hiroshi:tennis

Sofia Hiroshi:work Seo Yoon:school Jakob:chess-club

Contrast this schema with a schema for the same data that doesn't treat column qualifiers as
data and instead has the same columns in every row:

Row key Column qualifier:value Column qualifier:value

Jose#1 Friend:Fred Circle:book-club

Jose#2 Friend:Gabriel Circle:work

Jose#3 Friend:Hiroshi Circle:tennis

Sofia#1 Friend:Hiroshi Circle:work

Sofia#2 Friend:Seo Yoon Circle:school

Sofia#3 Friend:Jakob Circle:chess-club

The second schema design causes the table to grow much faster.

If you're using column qualifiers to store data (#columns), give column qualifiers short but
meaningful names. This approach lets you reduce the amount of data that is transferred for
each request. The maximum size is 16 KB.

Create as many columns as you need in the table. Bigtable tables are sparse, and there's no
space penalty for a column that is not used in a row. You can have millions of columns in a
table, as long as no row exceeds the maximum limit of 256 MB per row.

Avoid using too many columns in any single row. Even though a table can have millions of
columns, a row shouldn't. A few factors contribute to this best practice:

It takes time for Bigtable to process each cell in a row.

Each cell adds some overhead to the amount of data that's stored in your table and sent
over the network. For example, if you're storing 1 KB (1,024 bytes) of data, it's much more
space-efficient to store that data in a single cell, rather than spreading the data across
1,024 cells that each contain 1 byte.

If your dataset logically requires more columns per row than Bigtable can process efficiently,
consider storing the data as a protobuf in a single column (#query-flux).

https://cloud.google.com/bigtable/docs/schema-design 5/14
3/18/24, 9:39 AM Schema design best practices | Bigtable Documentation | Google Cloud

Rows
Keep the size of all values in a single row under 100 MB. Make sure that data in a single row
doesn't exceed 256 MB. Rows that exceed this limit can result in reduced read performance.

Keep all information for an entity in a single row. For most use cases, avoid storing data that
you must read atomically, or all at once, in more than one row to avoid inconsistencies. For
example, if you update two rows in a table, it's possible that one row will be updated
successfully and the other update will fail. Make sure your schema does not require more than
one row to be updated at the same time for related data to be accurate. This practice ensures
that if part of a write request fails or must be sent again, that piece of data is not temporarily
incomplete.

Exception: If keeping an entity in a single row results in rows that are hundreds of MB, you
should split the data across multiple rows.

Store related entities in adjacent rows, to make reads more efficient.

Cells
Don't store more than 10 MB of data in a single cell. Recall that a cell is the data stored for a
given row and column with a unique timestamp, and that multiple cells can be stored at the
intersection of that row and column. The number of cells retained in a column is governed by
the garbage collection policy (/bigtable/docs/garbage-collection) that you set for the column
family that contains that column.

Row keys
Design your row key based on the queries you will use to retrieve the data. Well-designed row
keys get the best performance out of Bigtable. The most efficient Bigtable queries retrieve data
using one of the following:

Row key

Row key prefix

Range of rows defined by starting and ending row keys

Other types of queries trigger a full table scan, which is much less efficient. By choosing the
correct row key now, you can avoid a painful data migration process later.

https://cloud.google.com/bigtable/docs/schema-design 6/14
3/18/24, 9:39 AM Schema design best practices | Bigtable Documentation | Google Cloud

Keep your row keys short. A row key must be 4 KB or less. Long row keys take up additional
memory and storage and increase the time it takes to get responses from the Bigtable server.

Store multiple delimited values in each row key. Because the best way to query Bigtable
efficiently is by row key, it's often useful to include multiple identifiers in your row key. When
your row key includes multiple values, it's especially important to have a clear understanding of
how you use your data.

Row key segments are usually separated by a delimiter, such as a colon, slash, or hash symbol.
The first segment or set of contiguous segments is the row key prefix, and the last segment or
set of contiguous segments is the row key suffix.

Well-planned row key prefixes let you take advantage of Bigtable's built-in sorting order to store
related data in contiguous rows. Storing related data in contiguous rows lets you access
related data as a range of rows, rather than running inefficient table scans.

If your data includes integers that you want to store or sort numerically, pad the integers with
leading zeroes. Bigtable stores data lexicographically (#general-concepts). For example,
lexicographically, 3 > 20 but 20 > 03. Padding the 3 with a leading zero ensures that the
numbers are sorted numerically. This tactic is important for timestamps where range-based
queries are used.

It's important to create a row key that makes it possible to retrieve a well-defined range of
rows. Otherwise, your query requires a table scan, which is much slower than retrieving
specific rows.

For example, if your application tracks mobile device data, you can have a row key that
consists of device type, device ID, and the day the data is recorded. Row keys for this data
might look like this:

phone#4c410523#20200501
phone#4c410523#20200502
tablet#a0b81f74#20200501

https://cloud.google.com/bigtable/docs/schema-design 7/14
3/18/24, 9:39 AM Schema design best practices | Bigtable Documentation | Google Cloud

tablet#a0b81f74#20200502

This row key design lets you retrieve data with a single request for:

A device type

A combination of device type and device ID

This row key design would not be optimal if you want to retrieve all data for a given day.
Because the day is stored in the third segment, or the row key suffix, you cannot just request a
range of rows based on the suffix or a middle segment of the row key. Instead, you have to
send a read request with a filter that scans the entire table looking for the day value.

Use human-readable string values in your row keys whenever possible. This practice makes it
easier to use the Key Visualizer tool (/bigtable/docs/keyvis-overview) to troubleshoot issues with
Bigtable.

Often, you should design row keys that start with a common value and end with a granular
value. For example, if your row key includes a continent, country, and city, you can create row
keys that look like the following so that they automatically sort first by values with lower
cardinality:

asia#india#bangalore
asia#india#mumbai
asia#japan#okinawa
asia#japan#sapporo
southamerica#bolivia#cochabamba
southamerica#bolivia#lapaz
southamerica#chile#santiago
southamerica#chile#temuco

Note: In production, you would probably use identifiers that take up less storage, like AS for Asia or 591 for
Bolivia.

Row keys to avoid

Some types of row keys can make it difficult to query your data, and some result in poor
performance. This section describes some types of row keys that you should avoid using in

https://cloud.google.com/bigtable/docs/schema-design 8/14
3/18/24, 9:39 AM Schema design best practices | Bigtable Documentation | Google Cloud

Bigtable.

Row keys that start with a timestamp. This pattern causes sequential writes to be pushed onto
a single node, creating a hotspot
(/bigtable/docs/schema-design-time-series#ensure_that_your_row_key_avoids_hotspotting). If you put a
timestamp in a row key, precede it with a high-cardinality value like a user ID to avoid hotspots.

Row keys that cause related data to not be grouped. Avoid row keys that cause related data to
be stored in non-contiguous row ranges, which are inefficient to read together.

Sequential numeric IDs. Suppose that your system assigns a numeric ID to each of your
application's users. You might be tempted to use the user's numeric ID as the row key for your
table. However, because new users are more likely to be active users, this approach is likely to
push most of your traffic to a small number of nodes.

A safer approach is to use a reversed version of the user's numeric ID, which spreads traffic
more evenly across all of the nodes for your Bigtable table.

Frequently updated identifiers. Avoid using a single row key to identify a value that must be
updated frequently. For example, if you store memory-usage data for a number of devices once
per second, do not use a single row key for each device that is made up of the device ID and
the metric being stored, such as 4c410523#memusage, and update the row repeatedly. This type
of operation overloads the tablet that stores the frequently used row. It can also cause a row to
exceed its size limit, because a column's previous values take up space until the cells are
removed during garbage collection.

Instead, store each new reading in a new row. Using the memory usage example, each row key
can contain the device ID, the type of metric, and a timestamp, so the row keys are similar to
4c410523#memusage#1423523569918. This strategy is efficient because in Bigtable, creating a
new row takes no more time than creating a new cell. In addition, this strategy lets you quickly
read data from a specific date range by calculating the appropriate start and end keys.

For values that change frequently, such as a counter that is updated hundreds of times each
minute, it's best to keep the data in memory, at the application layer, and write new rows to
Bigtable periodically.

Hashed values. Hashing a row key removes your ability to take advantage of Bigtable's natural
sorting order, making it impossible to store rows in a way that are optimal for querying. For the
same reason, hashing values makes it challenging to use the Key Visualizer tool to
troubleshoot issues with Bigtable. Use human-readable values instead of hashed values.

https://cloud.google.com/bigtable/docs/schema-design 9/14
3/18/24, 9:39 AM Schema design best practices | Bigtable Documentation | Google Cloud

Values expressed as raw bytes rather than human-readable strings. Raw bytes are fine for
column values, but for readability and troubleshooting, use string values in row keys.

Special use cases


You may have a unique dataset that requires special consideration when designing a schema
to store it in Bigtable. This section describes some, but not all, different types of Bigtable data
and some suggested tactics for storing it in the most optimal way.

Time-based data
Include a timestamp as part of your row key if you often retrieve data based on the time when
it was recorded.

For example, your application might record performance-related data, such as CPU and
memory usage, once per second for many machines. Your row key for this data could combine
an identifier for the machine with a timestamp for the data (for example,
machine_4223421#1425330757685). Keep in mind that row keys are sorted lexicographically
(#row-keys).

Don't use a timestamp by itself or at the beginning of a row key, because this will cause
sequential writes to be pushed onto a single node, creating a hotspot. In this case, you need to
consider write patterns as well as read patterns.

If you usually retrieve the most recent records first, you can use a reversed timestamp in the
row key by subtracting the timestamp from your programming language's maximum value for
long integers (in Java, java.lang.Long.MAX_VALUE). With a reversed timestamp, the records
will be ordered from most recent to least recent.

Note: A timestamp is usually the number of microseconds since 1970-01-01 00:00:00 UTC.

For information specifically about working with time series data, see Schema design for time
series data (/bigtable/docs/schema-design-time-series).

Multi-tenancy

https://cloud.google.com/bigtable/docs/schema-design 10/14
3/18/24, 9:39 AM Schema design best practices | Bigtable Documentation | Google Cloud

Row key prefixes provide a scalable solution for a "multi-tenancy" use case, a scenario in which
you store similar data, using the same data model, on behalf of multiple clients. Using one
table for all tenants is the most efficient way to store and access multi-tenant data.

For example, say you store and track purchase histories on behalf of many companies. You
can use your unique ID for each company as a row key prefix. All data for a tenant is stored in
contiguous rows in the same table, and you can query or filter using the row key prefix. Then,
when a company is no longer your customer and you need to delete the purchase history data
you were storing for the company, you can drop the range of rows
(/bigtable/docs/reference/admin/rpc/google.bigtable.admin.v2#google.bigtable.admin.v2.BigtableTableAd
min.DropRowRange)
that use that customer's row key prefix.

For example, if you are storing mobile device data for customers altostrat and
examplepetstore, you can create row keys like the following. Then, if altostrat is no longer
your customer, you drop all rows with the row key prefix altostrat.

altostrat#phone#4c410523#20190501
altostrat#phone#4c410523#20190502
altostrat#tablet#a0b41f74#20190501
examplepetstore#phone#4c410523#20190502
examplepetstore#tablet#a6b81f79#20190501
examplepetstore#tablet#a0b81f79#20190502

In contrast, if you store data on behalf of each company in its own table, you can experience
performance and scalability issues. You are also more likely to inadvertently reach Bigtable's
limit of 1,000 tables per instance. After an instance reaches this limit, Bigtable prevents you
from creating more tables in the instance.

Privacy
Unless your use case demands it, avoid using personally identifiable information (PII) or user
data in row keys or column family IDs. The values in row keys and column families are both
customer data and service data, and applications that use them, like encryption or logging, can
inadvertently expose them to users who shouldn't have access to private data.

For more information about how service data is handled, see the Google Cloud Privacy Notice
(/terms/cloud-privacy-notice).

https://cloud.google.com/bigtable/docs/schema-design 11/14
3/18/24, 9:39 AM Schema design best practices | Bigtable Documentation | Google Cloud

Domain names
Wide range of domain names

If you're storing data about entities that can be represented as domain names, consider using
a reverse domain name (for example, com.company.product) as the row key. Using a reverse
domain name is an especially good idea if each row's data tends to overlap with adjacent rows.
In this case, Bigtable can compress your data more efficiently.

In contrast, standard domain names that are not reversed can cause rows to be sorted in such
a way that related data is not grouped together in one place, which can result in less efficient
compression and less efficient reads.

This approach works best when your data is spread across many different reverse domain
names.

To illustrate this point, consider the following domain names, automatically sorted in
lexicographic order by Bigtable:

drive.google.com
en.wikipedia.org
maps.google.com

This is undesirable for the use case where you want to query all rows for the google.com. In
contrast, consider the same rows where the domain names have been reversed:

com.google.drive
com.google.maps
org.wikipedia.en

In the second example, the related rows are automatically sorted in a way that makes it easy to
retrieve them as a range of rows.

Few domain names

If you expect to store a lot of data for only one or a small number of domain names, consider
other values for your row key. Otherwise, you might push writes to a single node in your cluster,

https://cloud.google.com/bigtable/docs/schema-design 12/14
3/18/24, 9:39 AM Schema design best practices | Bigtable Documentation | Google Cloud

resulting in hotspots, or your rows might grow too large.

Changing or uncertain queries


If you don't always run the same queries on your data, or you are unsure what your queries will
be, one option is to store all the data for a row in one column instead of multiple columns. With
this approach, you use a format that makes it less difficult to extract the individual values later,
such as the protocol buffer (https://developers.google.com/protocol-buffers/docs/overview) binary
format or a JSON file.

The row key is still carefully designed to make sure you can retrieve the data you need, but
each row typically has only one column that contains all the data for the row in a single
protobuf.

Storing data as a protobuf message in one column instead of spreading the data into multiple
columns has advantages and disadvantages. Advantages include the following:

The data takes up less space, so it costs you less to store it.

You maintain a certain amount of flexibility by not committing to column families and
column qualifiers.

Your reading application does not need to "know" what your table schema is.

Some disadvantages are the following:

You have to deserialize the protobuf messages after they are read from Bigtable.

You lose the option to query the data in protobuf messages using filters.

You can't use BigQuery to run federated queries on fields within protobuf messages after
reading them from Bigtable.

What's next
Learn how to design a schema for time-series data
(/bigtable/docs/schema-design-time-series).

Review the steps involved in planning a schema (/bigtable/docs/schema-design-steps).

Review the types of write requests (/bigtable/docs/writes) you can send to Bigtable.

https://cloud.google.com/bigtable/docs/schema-design 13/14
3/18/24, 9:39 AM Schema design best practices | Bigtable Documentation | Google Cloud

Review the applicable quotas and limits (/bigtable/quotas).

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License
(https://creativecommons.org/licenses/by/4.0/), and code samples are licensed under the Apache 2.0 License
(https://www.apache.org/licenses/LICENSE-2.0). For details, see the Google Developers Site Policies
(https://developers.google.com/site-policies). Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2024-03-13 UTC.

https://cloud.google.com/bigtable/docs/schema-design 14/14

You might also like