UUID or GUID As Primary Keys Be Careful
UUID or GUID As Primary Keys Be Careful
Be Careful!
Tom HarrisonFollow
Feb 12, 2017
I just read a post on ways to scale your database that hit home with me —
the author suggests the use of UUIDs (similar to GUIDs) as the primary
key (PK) of database tables.
Don’t be naive
A naive use of a UUID, which might look like 70E2E8DE-500E-4630-B3CB-
166131D35C21, would be to treat as a string, e.g. varchar(36) — don’t do
that!!
“Oh, pshaw”, you say, “no one would ever do such a thing.”
Things got really bad in one company where they had originally decided
to use Latin-1 character set. When we converted to UTF-8 several of the
compound-key indexes were not big enough to contain the larger strings.
Doh!
Not just on disk but during joins and sorts these keys need to live in
memory. Memory is getting cheaper, but whether disk or RAM, it’s
limited. And neither is free.
The original issue with simple auto-incrementing values is that they are
easily guessable as I noted above. Botnets will just keep guessing until
they find one. (And they may keep guessing if you use UUIDs, but the
chance of a correct guess is astronomically lower).
Arguably it would be a fool’s errand to try to guess a UUID,
however Microsoft warns against using newsequentialid() because by
mitigating the clustering issue, it makes the key more guessable.
When would you need to change keys? As it happens, we’re doing a data
migration this week, because who knew in 2003 when the company
started that we would now have 13 massive SQL Server databases and
growing fast?
Never say “never”. I have been there and done that, and it has happened
several times just for me. It’s easy to manage up front. It’s way harder to
fix when you’re counting things in the trillions.
This way, if you ever do have to change your internal primary keys, you
can be sure it’s scoped only to one database. (Note: this is just plain
wrong, as Chris observed)
In another case, we would generate a “slug” of text (e.g. in blog posts like
this one) that would make the URL a little more human friendly. If we
had a duplicate, we would just append a hashed value.
Even as a “secondary primary key”, using a naive use of UUIDs in string
form is wrong: use the built-in database mechanisms as values are
stored as 8-byte integers, I would expect.
Then came across an interesting post by Starr, which got me thinking his
advice might have unintended outcomes. So I googled and learned way
more about UUIDs than I knew before, and changed my fundamental
understanding and disposition about how and when to use them.
Halfway through writing this, I sent email to the team leads at my
company wondering if we had considered one of the topics I discussed.
Hopefully we’re ok, but I think we may have avoided at least one
unexpected surprise in code scheduled for release this week.
Image Credit
Software Architecture
Big Data
Scaling
Data Science
Database
4.2K claps
53
Follow
Tom Harrison
Medium member since Mar 2017
30 Years of Developing Software, 20 Years of Being a Parent, 10 Years of Being Old. (Effective: 2019)
Follow
Related reads
REST vs. GraphQL: A Critical Review
Z
Jun 10, 2018
6.6K
Related reads
Be careful with CTE in PostgreSQL
Haki Benita
Sep 17, 2018
640
Bivás Biswas
May 28
2
Responses
Write a response…
Conversation between Chris Russell and Tom Harrison.
Chris Russell
Jun 10, 2017
Thank you for taking the time to share your experiences… this is a valuable topic.
There are a couple of points which I think should be made.
1. On security: UUIDS are intended to be unique, not secure. They are not
considered `hard-to-guess`. Harder to manually increment? sure… harder for a
script? Maybe. But, depending on the…
Read more…
233
2 responses
Tom Harrison
Jun 11, 2017
Read more…
11
Conversation between Bit-Booster! and Jim E. Rustle.
Bit-Booster!
Jun 9, 2017
The 4th reason UUID / GUID’s are good is really important: it allows users to change
their names, their SIN numbers, their email addresses, and your system can handle
that if you key with UUID / GUID.
Yes, don’t store them as Strings in a DB. Store them as BINARY(16) or VARBINARY if
your db supports those column types.
42
2 responses
Jim E. Rustle
Jun 9, 2017
it allows users to change their names, their SIN numbers, their email addresses, and
your system can handle that if you key with UUID / GUID.
Not sure I follow, what do names, email addresses, or in other words, non-primary key
columns, have to do with the primary key?
107
Conversation between Jonathan Garbee and Tom Harrison.
Jonathan Garbee
Oct 12, 2017
If this leads to a data leak, it means the web application isn’t engineered properly for
security in the first place. An access control list should be in place to prevent
unauthorized access to resources. There is no security vulnerability using PKs in
public if your application is securely engineered.
129
1 response
Tom Harrison
Oct 12, 2017
Hi Jonathan —
Thanks for your response. To be sure, hiding a sequence id behind a GUID is merely
one step that is really not much more than “security by obscurity” and proper security
is foundational. I think the security case here is more around information leaking —
that is, exposing the internals of your application may…
Read more…
3
Conversation between Nicolas Grilly and Tom Harrison.
Nicolas Grilly
Jun 9, 2017
the 9x cost
47
1 response
Tom Harrison
Jun 11, 2017
4
Conversation between Nicolas Froidure and Tom Harrison.
Nicolas Froidure
Feb 19, 2017
By combining UUIDs and Auto Incremented Primary Keys, you loose the advantage of
generating ids on the client side (to perform PUTs instead of POSTs) which is the
main reason why people use UUIDs.
14
3 responses
Tom Harrison
Feb 20, 2017
At first this is what I thought, and am quite possibly wrong, but as I looked at it, SQL
Server, at least, implements the sequential UUID as a function, which I assumed
means you could call the function (e.g. select newsequentialid()) then use the
returned value in subsequent inserts. Because it is still a GUID … right? … it should be
guaranteed…
Read more…
2 responses
Nicolas Froidure
Feb 20, 2017
By client I meant the HTTP API client, an additional round trip to the server before
generating a resource is probably not an option (apart from maintaining a pool of
directly usable ids in the frontend application but it adds a lot of complexity).
Thank you too for your helpful posts ;).
2
Applause from Tom Harrison (author)
Alec Zopf
Jun 10, 2017
It is probable that there are similar solutions for other databases, certainly
PostgreSQL, MySQL and likely the rest.
11
1 response
Applause from Tom Harrison (author)
Ricardo Puerto
Sep 8, 2017
2 billion
21
Applause from Tom Harrison (author)
Jens Alfke
Jun 10, 2017
Aside from the 9x cost in size, strings don’t sort as fast as numbers because they rely
on collation rules.
Where did you get “9x cost in size” from? Hex strings are 2x as large as binary. The
unnecessary hyphens add a little bit, but it’s still only 36 bytes when it could be 16,
which is 2.25x.
I don’t think there’s much overhead for string vs numeric sorting if you’re using a
default collation rule; maybe if your db uses Unicode…
Read more…
18
Conversation between Alex Maslakov and Tom Harrison.
Alex Maslakov
Jun 10, 2017
Don’t!
I will!
12
2 responses
Tom Harrison
Jun 11, 2017
Pierre Phaneuf
May 3, 2018
This problem can be an advantage, when using a distributed database, like Cosmos DB
or Cloud Spanner, where you’d want your primary keys to be uniformly distributed, to
avoid a partition becoming a “hot spot”.
28
Applause from Tom Harrison (author)
Bruno Brant
Jun 10, 2017
Sorry.
18
Applause from Tom Harrison (author)
João Almeida
Sep 8, 2017
Like when people don’t want to move their data to another database, until they do. I’m
a big, big fan of not putting any business meaning on top primary keys. Couldn’t agree
more with this section.
9
Conversation between 張旭 and Tom Harrison.
張旭
Jun 10, 2017
we would generate a “slug” of text (e.g. in blog posts like this one) that would make
the URL a little more human friendly.
“/uuid-or-guid-as-primary-keys-be-careful-7b2aa3dcb439”
4
1 response
Tom Harrison
Jun 11, 2017
Exactly :-)
Alex Nishikawa
Dec 13, 2017
I keep coming back to this article (and the comments section). Thanks for writing this
Tom.
7
Conversation between Julien Desrosiers and Tom Harrison.
Julien Desrosiers
Jun 9, 2017
When would you need to change keys? As it happens, we’re doing a data migration
this week, because who knew in 2003 when the company started that we would now
have 13 massive SQL Ser...
I’m curious. What kind of data migration did you do for you to need to change the
columns’ UUIDs in that particular case? Why not just dump the data and re-INSERT
it elsewhere while keeping the same UUIDs?
Thanks
3
1 response
Tom Harrison
Jun 10, 2017
The problem is simple: we don’t use UUIDs. If we had used them, it would be as trivial
as you say. Because we use auto-incremented integer (or bigint) values for keys the
process of moving is impossibly complex. (Sigh)
6
Conversation between Mingwei Zhang and Tom Harrison.
Mingwei Zhang
Jun 9, 2017
I enjoyed this section in particular. Always wondering why should people blog about
anything.
Thank you for sharing!
4
1 response
Tom Harrison
Jun 11, 2017
Most other bloggers do it for the fame, or just the amazing flood of cash it generates.
I haven’t yet achieved these heights of excellence (but I have only been trying for a
decade or so, so I ain’t givin’ up).
For now, I settle for the benefits of forcing me to think through problems that are…
Read more…
19
Conversation between Carlos Henrique Romano and Tom Harrison.
Thanks for posting. I wonder what's the problem of an approach like Youtube.
You would use an internal (as in "not exposed") alphabet and convert from an integer
to a string (which you can make public) and vice-versa. Let's say the integer ID 12345
would be converted to something like aXpct; when you receive a request with…
Read more…
2
1 response
Tom Harrison
Jun 11, 2017
I think it’s important to distinguish how a value is represented versys how it is stored.
Of the UUID solutions I researched, all store the value internally as an integer which
can be encoded and decoded as a hash-like value, then store the hash as its decimal
numeric value. The utility of hashing is that the algorithm can be forced to produce
an…
Read more…
2 responses
Do read the comments (and today’s clarifications) about the risk of exposing UUIDs,
or even hashed values both from a security or sharding standpoint
No doubt internal information should not be exposed, you explained the reasons why;
but you need someway to map the external value to the internal one and what I'm
saying here is that the Youtube approach sounds like a pretty good (and simple!) one.
1
Conversation between Clifford Hammerschmidt and Tom Harrison.
Clifford Hammerschmidt
May 3, 2018
31
1 response
Tom Harrison
May 4, 2018
I like the approach of pg_tuid — you get natural clustering from the timestamp
component and it’s atomic.
But I wonder if the notion of data locality in the index is really a problem these days? I
think SSD drives changed this equation dramatically.
Read more…
1 response
Conversation between Allan Wind and Tom Harrison.
Allan Wind
Jun 11, 2017
The definition of a primary key is an identity that never change change. If the int
primary key changes, then presumably so would the uuid key, which is what you
expose externally and still problematic. What am I missing? To me it means you have
a schema problem either the wrong key, or possible table layout. One system I used,
has a redirect feature…
Read more…
1
1 response
Tom Harrison
Jun 11, 2017
Agree on definition of PK — never changes. And yes, systems like WordPress that use
slug-based URLs implement the redirect with a secondary table as you suggest (and so
did a system a company I was at some years ago). This allows the author to change the
title while ensuring all copies still point to the correct page. Most modern systems,
including…
Read more…
7
Conversation with Tom Harrison.
Danielo Rodríguez
Sep 9, 2017
I think you are restricting your story to relational SQL databases, but anyway.
MongoDB has it’s own implementation of UUIDS, the BSON type which is used as
default for the “primary-key” _id. It can be represented as a string, sure, but it is not
saved as a string, and it is easily sortable. In fact, it includes some extra…
Read more…
1 response
Tom Harrison
Sep 10, 2017
Wow, totally great perspective! Yes, this conversation is completely focused on the
relational DB front, and yes, the problems that RDBMS solve create a far different set
of design constraints than document-oriented databases like Mongo.
I wrote this article when I first understood how the company I was with had an…
Read more…
3
2 responses
Conversation between Pierre Phaneuf and Tom Harrison.
Pierre Phaneuf
May 3, 2018
Not just on disk but during joins and sorts these keys need to live in memory.
If you’re really, REALLY going to scale, then these just won’t fit in memory anyway.
5
1 response
Tom Harrison
May 4, 2018
I could perhaps have changed my language to say that there’s a performance benefit to
having the index in memory vs. spilled to disk, even if not always possible.
Today, there’s almost no case now where more memory is not an option — we’re
running jobs on Hadoop that have access to several terabytes of memory as needed…
Read more…
1
Conversation with Tom Harrison.
Alexey Migutsky
Jun 11, 2017
1 response
Tom Harrison
Jun 11, 2017
I am sure I agree — thanks for the video post which I’ll check out later.
I think the size, and the few bytes, and string versus native type all only begin to
matter at scale. We have a database that is highly normalized — pretty much all
repeating data is extracted to a separate table and referenced by FK in the parent…
Read more…
16
Conversation between Malcolm Hall and Tom Harrison.
Malcolm Hall
May 8, 2017
I’ve tried combining them but there were issues with join overhead, cascade deletes,
enforcing uniqueness of the identifier. After fighting the database I ended up going
back to using a string key.
15
1 response
Tom Harrison
Sep 10, 2017
Wow, interesting. So do you mean having some strings, some native UUID type, some
integer?
The uniqueness constraint is funny, in a way. Here’s a value whose scope is proximate
to infinity, but also expected to have a trivially small case of key collisions. A trillion or
so ain’t what it used to be!
Read more…
Conversation with Tom Harrison.
Nate Bessa
Apr 30, 2018
So, in reality, the right solution is probably: use UUIDs for keys, and don’t ever
expose them
1 response
Tom Harrison
May 4, 2018
It’s a year or so after I have written the article, but I think my recommendations would
be:
if you have a large database or some other reason to physically segment or relocate
data to a separate database instance (or will have this), you should use a GUID as
the PK.
For security it’s best not to…
Read more…
31
Conversation with Tom Harrison.
Kayvan Arianpour
May 24, 2018
Read more…
1 response
Tom Harrison
May 24, 2018
“over-normalize” … when I was learning the theory and practice of 3rd normal form,
the rule I tended to follow was “normalize first, de-normalize only when needed”. At
my current company we have a database schema that scrupulously followed this
approach, and I can say after 15 years, this strategy holds up well every day for us.
Read more…
Conversation with Tom Harrison.
Jens Alfke
Jun 10, 2017
because UUIDs are random, they have no natural ordering so cannot be used for
clustering.
This is a very weird statement, considering that using UUIDs as keys is extremely
common in NoSQL databases that use clustering, for example Couchbase Server. It’s
typical in such a database, as in a DHT, to map records to shards by hashing the key,
so there’s no need for any “natural ordering”. In fact, ordering would mess up the
distribution by…
Read more…
12
1 response
Tom Harrison
Jun 11, 2017
Read more…
1 response
Conversation with Tom Harrison.
Fedor Losev
Apr 23, 2018
The article talks about giving up global scalability and development simplicity.
Instead, some small space and performance gains are suggested by adding a non-
trivial layer of complexity. No reason was convincing in our days of very cheap storage
space and very expensive human resources. If anything, it was convincing for exactly
the opposite, go…
Read more…
1 response
Tom Harrison
Aug 19, 2018
I think I generally agree with your points. Most databases will not hit the scale or size
where GUIDs have value, so yes, a premature optimization (and complication, etc.) in
most cases.
I don’t agree that the cost is approximately the same. The data system that I am
working with now has hundreds of tables, over 10’s of…
Read more…
Conversation with Tom Harrison.
Read more…
1 response
Tom Harrison
Aug 19, 2018
If you have the space, a trigger that generates a UUID for storage in a secondary
column is a simple idea. This feels like “future-proofing” which could either turn out
to be an act of prescience (in 5 or 10 years) or an act of premature optimization :-).
Your approach seems efficient and simple, though, so if you think it may be
warranted, it’s probably a good approach.
1 response
Conversation with Tom Harrison.
Jeremy Solarz
Jun 29, 2017
1 response
Tom Harrison
Jun 30, 2017
Thanks for your question. In our case we defined a value for each customer that is
unique across the company, and thus across databases. That was a business choice (a
good one) rather than an auto-increment. Thus, adding it assures uniqueness. I have
clarified the text.
Conversation with 張旭.
張旭
Jun 10, 2017
using a naive use of UUIDs in string form is wrong: use the built-in database
mechanisms.
1 response
張旭
Jun 10, 2017
1
Conversation with Tom Harrison.
Phil Walsh
Aug 18, 2018
Hi Tom — I keep coming back to this post, but never quite resolving the question I
have in my head — maybe you’d like to comment/advise?
We are currently using uuid’s as our primary keys, but:
we are generating them in python and passing them into the database (rather than
using mysql’s UUID()…
Read more…
1 response
Tom Harrison
Aug 19, 2018
Hi Phil —
I don’t see any reason why it would be better or different to generate UUIDs
programmatically rather than letting the DB do it. The number of possible values is so
vast that collision is improbable.
The database probably has a UUID or GUID type (which is probably just a 128-bit
number), and…
Read more…
Conversation with Tom Harrison.
Jonathan Garbee
Oct 14, 2017
With *proper security* setup via an access control list. Random guesses won’t matter.
That’s why having proper security built into the application as it is designed is far
superior than any obfuscation techniques you can think to implement to refute access.
1 response
Tom Harrison
Oct 15, 2017
Thanks for comment, but I think my point is different. We all agree that security-by-
obscurity is insufficient. Proper security focuses on many other aspects of hardware,
software, database and systems design, and is an entire separate discussion.
Read more…
Conversation with Tom Harrison.
Tom Lo
May 29, 2018
I have a question. I heard a person argue using UUID as primary key decrease the
chance of database contention as UUID is random enough to make B+ tree insert
tuples random enough to it so that the tree will not be unbalanced when inserting and
deleting. Is it correct? I never be abled to find any materials related to this argument.
1 response
Tom Harrison
Aug 19, 2018
I think this is true — UUID address space is so huge that the hashing algorithm result
in very few clusters in the btree. I have no idea how this might affect performance,
resource usage, or efficiency.
Good question!
Bit-Booster!
The 4th reason UUID / GUID’s are good is really important: it allows users to change their
names… Yes, don’t store them as Strings in a DB. Store them as BINARY(16) or VARBINARY if
your db supports those column types.
Tom Harrison
Jun 10, 2017
I think we’re on the same page. We agree that a properly normalized database is a
good thing, and allows the kind of change you refer to. The question is, what value do
you choose for those primary (and thus, foreign) keys. A usual choice is integer, or
perhaps bigint in the case that you’ll have more than a few billion records.
Read more…
1
1 response
Conversation with Tom Harrison.
Wout Mertens
Mar 19, 2018
I try to use natural keys whenever I can. Natural keys are intrinsic values of an object
that uniquely identify it. In cases where they are almost unique, I “slugify” them like
you do by adding a suffix.
It makes working with database data a lot more intuitive, at the cost of some bytes.
Just thought I’d mention that as an option when selecting a primary key.
1 response
Tom Harrison
Aug 19, 2018
Read more…
3
1 response
Conversation with Tom Harrison.
optimuspaul
Jun 11, 2018
8-byte integers
A uuid is 16 bytes though, so it would need to be spread across two columns if stored
as bigints.
1 response
Tom Harrison
Aug 19, 2018
I don’t think that’s necessarily true — if the database has a type for the internal storage
of UUID, it presumably creates a single column of that type having size large enough
to contain it, in this case, 128 bits (16 bytes).
Tom Harrison
I think it’s important to distinguish how a value is represented versys how it is stored. Do read
the comments (and today’s clarifications) about the risk of exposing UUIDs, or even hashed
values both from a security or sharding…
Well, that's not what I said. The hash would never be stored. It is just a value that
passed to a function would return the ID that you can use to find the record… f(hash)
= ID
1
Conversation with Tom Harrison.
Bob Wakefield
Sep 9, 2017
I actually rarely use GUIDS or UUIDs. My challenge has always been MDM issues like
generating customer numbers. They should be actually numbers for the reasons you
brought up about strings. They also have to be unique enterprise wide. Do you have an
advice, tips, or algorithms you like to use to generate customer numbers? I’ve written
one that takes…
Read more…
1 response
Tom Harrison
Sep 10, 2017
Hey Bob — thanks for the response. I am mos’ def’ not an expert on this stuff, but I can
offer what I have seen done. UUIDs are pretty much bad for everything except what
they are designed for, which is to count really high without having to care what your
next number is. Whether UUIDs are represented as a hex value or a decimal, they’re
both numbers…
Read more…
Conversation with Tom Harrison.
zaraguo
Jun 22, 2017
1 response
Tom Harrison
Jun 23, 2017
There’s no real thing called “secondary primary key” — it’s just something I made up :-)
What I meant is that you would have a column having the same semantics as a
primary key (unique, immutable, not null, indexed, etc.) but having the UUID value
rather than the sequential integer value. It’s secondary in that it is not…
Read more…
1 response
Conversation with Tom Harrison.
Christopher Smith
Jun 11, 2017
There’s a presumption that UUID’s must be random, which isn’t true. Type 1 UUID’s
are particularly useful because they naturally sort by time and also can be grouped by
machine (well, by MAC address). That can be very helpful in a number of situations.
1 response
Tom Harrison
Sep 10, 2017
Yes, you’re right, this wasn’t clear. Randomness is not important to the idea that you
can use something sufficiently likely to be unique that you don’t have to check the
constraint, even across database instances.
So most definitely having a predictable pattern in a value has huge utility. The use of a
MAC as an identifier…
Read more…
Conversation with Tom Harrison.
Renan Le Caro
Jun 9, 2017
1 response
Tom Harrison
Jun 11, 2017
Andrey Zharikov
Oct 5, 2017
UUIDs are much better for logs search during debugging / investigations. Say, we
make a todo service. And we have some todo list item object with key “10” and want to
find recent log entries for it. It’s very easy to find gazillion number of records, cause
many log entries will mention 10 for whatever reasons. With UUIDs logs search only
gives…
Read more…
1
Terence Alderson
Nov 2, 2018
We got lazy at my company and just passed 50 keys out to each client reserved and to
be used as primary keys. A lot of people thought we were crazy but we used a big int
and if you do the math on that … it’s pretty safe. If you hav a million users who do a
billion transactions that produce a new key every day that would still last you
something in…
Read more…
Josef Hovad
Feb 9
Very nice post! As Postgres user I guess pgcrypto function gen_random_uuid() looks
secure enough (at least as far as I investigated). I like your approach combining short
int for internal usage and uuid for any external …
Chris Seufert
Nov 17, 2017
The external/internal thing is probably best left to things like friendly-url treatments,
and then (as Medium does) with a hashed value tacked on the end. Thanks Chris!
The slug part of a medium URL is not canonical, its just there to assist with SEO. You
can delete any part of it and medium will figure out which article you want from the
id at the end. I would not be surprised if this is just the last 6 bytes (12 chars) of the
UUID for the article.
Great article!
Dzintars Klavins
Mar 18
Does this apply to event sourced distributed systems where every domain service owns
its own database?
Alexey Migutsky
Jun 11, 2017
Actually, it is “just a few bytes”. Here is a good argument from Greg Young on the cost
of data: https://youtu.be/JHGkaShoyNs?t=645
It’s worth to watch the whole video to get some ideas about the tradeoffs and their
relation to business domain and chosen architecture.
Adam Arold
Mar 11
You might want to add …in my opinion. For example I don’t think they are a pain, I
memorize UUIDs much better than numbers.
Alexey Migutsky
Jun 11, 2017
How would you manage this when you need to expose some kind of ID, which will be
unguessable by an atacker?
Thomas Riedel
Mar 11
But never, ever use GUIDs as internal references inside the database system.
Apparently, it is recommended a lot.
Zax
Sep 10, 2017
Read more…
Jonathan Garbee
Oct 12, 2017
if you ever need to change keys, all your external references are broken
UUIDs shouldn’t require the keys ever be changed, so this is puzzling. That aside, it is
entirely feasible to build a system to store the old IDs for the routing and forward
them to the new.
Chris Seufert
Nov 17, 2017
If our goal is to scale, and I mean really scale let’s first acknowledge that an int is not
big enough in many cases, maxing out at around 2 billion, which needs 4 bytes. We
have way...
What about storing the UUID as a BINARY(16), then it’s only 2x larger than a
BIGINT. Surely storing it as a pretty formatted a hex string is the worst way possible
in terms of storage size, even removing the dashes would be a significant
improvement. (11%)