ULID: Universally Unique Lexicographically Sortable Identifier

zackmorris · on Dec 27, 2018

See also Firebase push IDs:

https://firebase.googleblog.com/2015/02/the-2120-ways-to-ens...

Lexicographically sortable identifiers are critical for any distributed data store if you want anything close to consistency. I've run into the issue of not having them and having to settle for some kind of <autoincrement_id><UUID> key and it's a huge PITA. How this wasn't considered database 101 decades ago just blows my mind.

I'd like to see a spec included in this for synchronizing clocks or using RAFT/Paxos for generating ULIDs with strong guarantees on sort order.

Also a minor gripe - I wish that the ULID spec checked for microsecond collisions instead of millisecond, because that would be more useful for realtime networked gaming and simulations.

joatmon-snoo · on Dec 27, 2018

> Lexicographically sortable identifiers are critical for any distributed data store if you want anything close to consistency.

wat?

This doesn't make any sense to me. Requiring that events be partially ordered if not totally ordered is a consistency requirement, sure, but I'm failing to understand how a database's consistency depends on the data you're putting in the database.

cryptonector · on Dec 27, 2018

And now you need every host to have an atomic clock too.

Maybe CRDTs would be better.

edoceo · on Dec 27, 2018

I use ULID in a distributed system where some clocks are off by a bit, still works well, untill the skew is way off. The time can be extracted from the ULID for verification before INSERT.

riffraff · on Dec 28, 2018

Won't removing time bizs increase the chance of collisions? UUIDs rely on a lot of randomness to avoid the issue, but in ULID this is much more limited, and the timestamp fills in for some of it, afaiu

edoceo · on Dec 28, 2018

We don't remove the time bits, we inspect them for reasonable values. Eg: never ever older than 2014-01-01 and older than 24h from our clock gets extra logging+meta-data.

advisedwang · on Dec 28, 2018

I think they're talking about data locality. If a single processing step access data that are "neighbours", in some systems that data is likely served from a single server. This means a) better performance and b) higher likelihood of consistency.

Of course the guarantee is still merely eventual consistency, but data locality does reduce the chance this becomes visible.

Dylan16807 · on Dec 27, 2018

> Also a minor gripe - I wish that the ULID spec checked for microsecond collisions instead of millisecond, because that would be more useful for realtime networked gaming and simulations.

If you can manage the difficult task of getting your clocks synced that precisely, it's probably easy to shift two characters over from random to timestamp.

dataflow · on Dec 27, 2018

> If you can manage the difficult task of getting your clocks synced that precisely

It's only a matter of time (in both senses I guess). See e.g. https://blogs.technet.microsoft.com/networking/2018/07/18/to...

antonvs · on Dec 28, 2018

> How this wasn't considered database 101 decades ago just blows my mind.

Decades ago they didn't have distributed data stores.

vladsanchez · on Dec 28, 2018

What is a database then? They've existed for close to 40 years now.

dragonwriter · on Dec 28, 2018

Until recently (and even recently, for the most part) databases have been, almost entirely, centralized (though often shared) stores, not distributed stores.

nixpulvis · on Dec 27, 2018

At a certain point you'll end up wanting fancy clocks too. I believe this is what Google does with Spanner.

AYBABTME · on Dec 27, 2018

the Go oklog/ulid package now supports that

ngrilly · on Dec 28, 2018

Supports what?

wyldfire · on Dec 28, 2018

> I wish that the ULID spec checked for microsecond collisions instead of millisecond

Most computers don't have wall clocks with even more millisecond precision, much less microsecond. The wall clocks are generally OS tick precision (~5-10ms ballpark) with poorer still accuracy.

Seems like the use case is one that is niche. Maybe an extension?

pspeter3 · on Dec 27, 2018

Do you know of any ID systems that have that behavior?

zackmorris · on Dec 27, 2018

Unfortunately no, but some (many?) REST APIs turn the problem on its head by using private autoincrement IDs in the database and public UUIDs with either a direct 1:1 mapping between the two or a mapping that is unique to each user. This is mostly done for security reasons but I'm not sure if it's considered a best practice because it can introduce additional complexity at the sharding and caching levels (among others):

https://softwareengineering.stackexchange.com/questions/3714...

https://stackoverflow.com/questions/15360245/when-using-uuid...

https://softwareengineering.stackexchange.com/questions/1394...

https://softwareengineering.stackexchange.com/questions/3469...

nixpulvis · on Dec 27, 2018

I'm pretty sure Google's Spanner uses something like Paxos, but you should fact check that, it might just be for replication.

jayd16 · on Dec 28, 2018

Did anyone else find it ironically irritating that firebase example json was an unordered dictionary instead of a properly ordered list?

InGodsName · on Dec 28, 2018

Not just that, you can use range queries like this: https://redis.io/commands/zrangebylex

To find events between two specific points.

It can also be used in Dynamodb which is what we use in adtech.

brdd · on Dec 27, 2018

We have used ULIDs in production for over a year now-- and have generated millions of these.

First, the main benefit of ULID is that you can generate the IDs within your own software rather than rely on the database. We can queue them or even reference them before they land in the database. The traditional roundtrip has been eliminated.

Secondly, being able to sort ULIDs is a nice plus, although not that big of a deal. It makes it relatively easy to shard or partition databases, and it provides a convenient sort if you're not looking for extreme accuracy.

ULIDs are also shorter and slightly more user friendly than UUIDs.

In some circumstances we found the actual implementations to be slightly lacking. For example, the JS library for ULID once returned a 25 character string rather than the standard 26 characters, causing a big ruckus that we had to manually resolve.

munk-a · on Dec 28, 2018

> First, the main benefit of ULID is that you can generate the IDs within your own software rather than rely on the database. We can queue them or even reference them before they land in the database. The traditional roundtrip has been eliminated.

No you cannot, unless you're running a single threaded server process on a single machine. What you can do is _gamble_ that you probably won't have a collision, which is the same thing you could do with regular UUIDs and you'd be (nearly) guaranteed to never hit a conflict with UUID4 (or probably UUID1/2 if you trusted your mac address uniqueness).

You may find this gamble acceptable and many people do, but you should be aware that pre-generation of UUIDs on independent systems without coordination is not a solvable problem - all attempts to do so rely on the extreme unlikelihood of a collusion to feel good about it or use some coordinated information (like a guaranteed unique mac address).

(Again, if it's good enough for you, right on - but it isn't theoretically safe)

bearmcbearsly · on Dec 28, 2018

> No you cannot, unless you're running a single threaded server process on a single machine. What you can do is _gamble_ that you probably won't have a collision

This seems like a pointless distinction.

If I did the math right, you can generate 1,000,000 ULIDs per second (1000 per millisecond) for around 50 million years before you can expect to hit your first collision.

I don't know about you, but I'm pretty sure any system I build won't be running 50 million years from now. Not to mention that the timestamp portion of the ULID will overflow in a mere 9000 years.

ngrilly · on Dec 28, 2018

I did the math too after reading your comment and got the same result: on average, with 1,000,000 ULIDs per second, I'll wait 50 million years before hitting the first collision.

Here is the Python code I used: https://gist.github.com/ngrilly/565bd27f4ad63244f72578844bca...

But I'm curious to know how you computed this?

espadrine · on Dec 28, 2018

I agree that collision is not an issue.

On the other hand, your consistency assumptions may be broken if you rely on the sort order for causality relationship.

You can have ULID1 < ULID2 and yet the event associated with ULID1 may be caused by the ULID2 event.

yen223 · on Dec 28, 2018

Does your math rely on perfect entropy on machines you don't control?

ngrilly · on Dec 28, 2018

Yes. I did the math and got the same result, relying on a perfect entropy hypothesis.

dtech · on Dec 28, 2018

The point was that this was not an advantage of ULID over UUID. You need 2.7 * 10^18 v4 UUIDs for a 50% collision chance.

InGodsName · on Dec 28, 2018

Did you account for long tail events?

jakevn · on Dec 28, 2018

I don't see how it is at all a gamble if you understand the trade-offs.

If there is no fatal flaw with the inaccurate/non-deterministic sorting that wall-clock prefix provides and one is utilizing optimistic concurrency (strict create on PK, abort/retry on conflict), I don't see how it is unsafe

munk-a · on Dec 28, 2018

The gamble is the removal of a central key registry. If you are assigning these UUID/ULIDs as... UIDs then you need to be able to assign them uniquely... if multiple servers/threads/whatevers can generate and assign them independently then there is a chance you'll suffer from ID collision... it's incredibly rare though.

jakevn · on Dec 28, 2018

But with optimistic concurrency, why is this a gamble? It's an overhead that comes as a result of the trade-offs, sure, but it's a finite cost. Collision can be handled without breaking anything at the cost of implementation complexity.

warmwaffles · on Dec 28, 2018

> What you can do is _gamble_ that you probably won't have a collision, which is the same thing you could do with regular UUIDs and you'd be (nearly) guaranteed to never hit a conflict with UUID4 (or probably UUID1/2 if you trusted your mac address uniqueness).

UUID4 is no more safer than ULID since they both rely on randomness.

erik_seaberg · on Dec 27, 2018

There are definitely systems out there whose clocks are not accurate to the millisecond. It's not healthy for systems to encourage false assumptions (e.g., that ids monotonically increase).

nixpulvis · on Dec 27, 2018

I was surprised to find nothing about clock synchronization or quality in the README...

> Monotonic sort order (correctly detects and handles the same millisecond)

EDIT: None of these concerns are directly related to the data format, but would be something I'd explain before users make false assumptions.

ngrilly · on Dec 28, 2018

This guarantees that ULIDs generated by the same machine are monotonic. Of course, the guarantee doesn't apply to different machines.

Dylan16807 · on Dec 27, 2018

The failure on overflow is weird.

Since it starts at a random point and can't overflow, even if you generate a small number of IDs per millisecond you have a constant 1/2^79 chance of failing. The chance is small but reachable for a large network. (bitcoin does 2^88 hashes a day)

It could have just wrapped with no problem, because there's no possible node that could generate 2^80 IDs by itself. And if you have multiple nodes it doesn't help there either.

staticassertion · on Dec 27, 2018

Bitcoin does 2^88 a day but from what I can tell the overflow issue means you'd have to do that much in a millisecond. Seems unlikely?

Dylan16807 · on Dec 27, 2018

The value starts at a random point and can't overflow. This means that in a particular millisecond you might start at ZZZZZZZZZZZZZHJ and hit an error after a few hundred IDs. Your chance of hitting an issue is (roughly) proportional to the total number of IDs you generate.

BeeOnRope · on Dec 27, 2018

They also say "Within the same millisecond, sort order is not guaranteed" but the carry behavior seems to be intended to preserve order even with a millisecond. I guess they mean order is not guaranteed between different UUID generators with different random parts.

IanCal · on Dec 28, 2018

> (bitcoin does 2^88 hashes a day)

Does it? At 50M TH/s (https://www.blockchain.com/en/charts/hash-rate) that's 10^8 TH/s, or 10^8.10^12 = 10^20 hashes per second. There's less than 10^5 seconds so surely that's only 10^25 per day. If I've messed up each of these numbers by an order of magnitude, that would leave it still well under 10^30.

2^88 is absolutely enormous.

Dylan16807 · on Dec 28, 2018

Okay, I checked what I calculated earlier. I accidentally entered the scientific notation wrong so that was actually the number of hashes in 100 days.

10^25 is close to right for one day, and 10^25 is equal to 2^83

2^88 is both absolutely enormous and a real number that bitcoin hits in less than a year.

So the point stands that a large deployment could rarely hit a 2^79 random failure case.

IanCal · on Dec 28, 2018

Fair enough, I'd mentally worked out the 10^ -> 2^ conversion entirely wrong, sorry about that.

daveFNbuck · on Dec 27, 2018

One of the constraints is that lexicographic sorting must order the nodes by generation time. This precludes wrapping. It's a little weird that they didn't just reserve or add some bits for sorting within milliseconds before or instead of having to increment the randomness.

jonas21 · on Dec 27, 2018

Yeah, they could have constrained the random value to always start with the highest bit set to 0. Then they would be guaranteed to be able to generate 2^79 values per millisecond.

But that's sort of an implementation detail. You could write an implementation that did that and still be compatible with all other implementations.

Dylan16807 · on Dec 27, 2018

> One of the constraints is that lexicographic sorting must order the nodes by generation time. This precludes wrapping.

I suppose, if you're particularly worried about the one-node situation.

> It's a little weird that they didn't just reserve or add some bits for sorting within milliseconds before or instead of having to increment the randomness.

There's no particular reason to make the fields separate. If you mask out the top bit of the random number then they can share and also make overflow effectively impossible.

It's also worth noting that there are two bits that go completely unused. They could eat overflow if the layout was slightly rearranged.

daveFNbuck · on Dec 27, 2018

> I suppose, if you're particularly worried about the one-node situation

This algorithm doesn't work if you have multiple nodes, so I don't think any other situation is relevant here.

> There's no particular reason to make the fields separate. If you mask out the top bit of the random number then they can share and also make overflow effectively impossible

Yes, this is what I meant by reserving some bits.

> It's also worth noting that there are two bits that go completely unused. They could eat overflow if the layout was slightly rearranged.

Yeah, it's super weird they didn't do that. I guess they just really like having a power of two number of bits.

Dylan16807 · on Dec 28, 2018

> This algorithm doesn't work if you have multiple nodes

Then what's the random part for?

> Yes, this is what I meant by reserving some bits.

What I'm saying is, if you separate it out into an increment-field and a random-field, then you need a lot of bits and you need to fundamentally change how it works.

If you merely make sure your random number starts below some threshold, you only need 1 bit, or a small fraction of a bit. You would still increment the random number, but you wouldn't have to worry about hitting the max value if you always start between 0 and 2^79.9, for example.

daveFNbuck · on Dec 28, 2018

> Then what's the random part for?

It's for multiple nodes, which is why this algorithm doesn't make any sense for their use case.

hiccuphippo · on Dec 27, 2018

Regarding multiple nodes, Wouldn't each node generate a different random number? It mentions not requiring a MAC but it should use some way to seed the rng differently per node, right? However this would defeat the lexicographic sorting.

BeeOnRope · on Dec 27, 2018

Yes, each node would have different random values, but sorting works because time is at the start of the UUID so "more significant", and sort order within a millisecond is not guaranteed.

floatingatoll · on Dec 27, 2018

How many milliseconds of downtime per day would be 99th percentile likely to occur in the scenario you describe?

What percentage of milliseconds in the day would that be?

Dylan16807 · on Dec 27, 2018

With a code path that's never been tested? It could potentially be a huge problem.

floatingatoll · on Dec 28, 2018

Assuming the coded algorithm works as specified, it’s possible to specify in probability terms what the likelihood of an algorithm outage are.

It’s turtles all the way down from there, but it’s still valuable to scope “best case scenario” with probability math regardless of potential bugs in code.

Dylan16807 · on Dec 28, 2018

It's possible to do the analysis, but instead of trusting that it gets done I'd rather use a slightly different method that only has a success path.

xucheng · on Dec 27, 2018

It’s worth noting that, since the ID in the same millisecond is incremented, it may suffer with enumerate attack. So it should not be used to generate one time token or object ID used in url address.

BugsJustFindMe · on Dec 27, 2018

> Uses Crockford's base32 for better efficiency and readability

Ugh. Crockford's base32 character set doesn't actually solve any of the problems it sets out to solve. Using it suggests to me some uncritical thinking.

It[0] says things like L is excluded, because "[uppercase] L Can be confused with 1". Ignoring the part where that is wildly inaccurate for any font that I've ever seen, why not then also remove G, 6, B, 8, Z, 2, S, or 5?

Reducing 1/I/i/L/l to just 1 does little to resolve visual ambiguity for users, because a user could just as easily read l or I instead of 1 or O instead of 0, because users don't know your made-up rules, which causes real problems because you often don't control both sides of the channel.

[0] - https://www.crockford.com/wrmg/base32.html

grzm · on Dec 27, 2018

Crockford Base32 folds those ambiguous characters on reading: l, L, I, and i are treated as 1; o and O are treated as 0. The user doesn’t need to know the rules.

From the page you quote:

> ”When decoding, upper and lower case letters are accepted, and i and l will be treated as 1 and o will be treated as 0. When encoding, only upper case letters are used.”

BugsJustFindMe · on Dec 27, 2018

G/6, B/8, Z/2, and S/5 are much more likely to actually be visually ambiguous than L/1.

cheald · on Dec 27, 2018

I have never typo'd a G for a 6, but I sure am glad my password manager shows digits in a different color from letters, because I confuse l/1/I and O/0 all the time.

RcouF1uZ4gsC · on Dec 27, 2018

I have confused these when they were handwritten, but not in typed form in a reasonable font.

ivan_gammel · on Dec 27, 2018

1/l is visually ambiguous. Since the encoding is case-insensitive, upper-case L must also be excluded. For the same reason lower-case L is not included in Base58.

BugsJustFindMe · on Dec 27, 2018

All codes are encoded only to uppercase in crockford32. If you actually want to get rid of the ambiguity, get rid of 1 and keep L.

ivan_gammel · on Dec 27, 2018

>All codes are encoded only to uppercase.

False. ULID specification clearly says that these identifiers are case-insensitive. There's no "uppercase" requirement anywhere.

jacobr1 · on Dec 28, 2018

crockford3 accepts any case, but the canonical output is uppercase

gliptic · on Dec 27, 2018

You didn't read very carefully. 1/I/i/L/l are all interpreted as 1, so it doesn't matter which one they are misreading it as. Do you contest that l can be misread as 1 or I? Because that's what the table means. The code is case-insensitive.

BugsJustFindMe · on Dec 27, 2018

> Do you contest that l can be misread as 1 or I?

I declare that 5 can be misread as S and yet there they are. (And that L looks like 1 or I far less frequently than 5 looks like S)

gliptic · on Dec 27, 2018

I'm pretty sure l looks like I more frequently than 5 looks like S. You have to draw the line somewhere.

throw_away2 · on Dec 28, 2018

  it's a1so very different in monospaced fonts where these
  identifiers are 1ike1y to be read.  'l' 1ooks much more
  1ike '1' in these fonts.

Solar19 · on Dec 28, 2018

Crockford also excludes the letter U, because of potential profanity, which I find incredibly odd. He's afraid of a rare FU embedded somewhere in an ID? Why would we care about that?

AaronFriel · on Dec 27, 2018

You seem upset.

Aren't your complaints solved by simply having the parser resolve inputs 1/I/i/L/l to the symbol 1, and inputs 0/O to the symbol 0?

gregwebs · on Dec 28, 2018

cuid removes more of the randomness adds a counter and a fingerprint: https://github.com/ericelliott/cuid

The default id in MongoDB does about the same. I always thought the MongoDB identifiers worked well for a lot of use cases.

Its also worth mentioning that integer incrementing ids can scale just fine if you reserve them in large blocks and they are no longer guaranteed to match insertion order, e.g: https://github.com/pingcap/docs/blob/master/sql/mysql-compat...

dataflow · on Dec 27, 2018

I like the idea, just also feel 1.21e+24 unique ULIDs per millisecond seems kind of defeated by the millisecond accuracy. This means there are effectively two tolerance values for time at play in the design of this spec that conflict with each other. If we want users to be able to generate ULIDs on such a short timescale (implying it's a realistic use case), then it would seem they should also be able to get comparable accuracy on the timestamp itself.

krupan · on Dec 27, 2018

It would sure be nice if git commit IDs could use something like this. It would be really convenient if you could look at two commit IDs and know which one is older.

Kip9000 · on Dec 27, 2018

What problem does this solve? Why is it necessary to sort a unique id?

marksomnian · on Dec 27, 2018

Since the ids are generated chronologically, it means you can sort by id instead of created_at. A nice convenience for sure.

taeric · on Dec 27, 2018

It can help operations if you know when an id came from. So that when you get a request id from a user, you don't also have to get a when.

mcbits · on Dec 28, 2018

> when you get a request id from a user

Just a public service reminder that "never trust the client" still applies, in case you were imagining user agents generating their own ULIDs to relive servers of the duty or something. Nothing prevents them from sending duplicate or out-of-order IDs.

taeric · on Dec 28, 2018

I've always been intrigued by having clients provide the id. I think I understand why that choice is made, but it does seem unusually error fragile.

BugsJustFindMe · on Dec 27, 2018

This comes out of Twitter Snowflake. The "prior art" links at the bottom are pretty informative.

UK-Al05 · on Dec 27, 2018

Using a ULID in a Clustered index?

dfox · on Dec 28, 2018

My experience is that you don't want human readable and user visible to be sortable, but to have as unique prefix (and when you have experienced workers also postfix) as possible. So this is certainly useful, but specifying human readable representation is somewhat redundant.

Another issue is that there are cases when you want to represent the ID as barcode of reasonable size and readability, which invariably leads to decimal-only Code128 with at most ~30 digits.

otterley · on Dec 27, 2018

See also https://github.com/segmentio/ksuid

rch · on Dec 27, 2018

And another (in Erlang, 128-bits, but uses the MAC address): https://github.com/boundary/flake

swah · on Dec 28, 2018

And https://github.com/rs/xid

What would be more interesting to me is a benchmark of how would using those as priary keys on Postgres would affect performance.

wvh · on Dec 28, 2018

If the libraries produce 128-bit values, then they could use Postgresql's UUID type. In fact, Postgresql should accept anything as a UUID in text mode as long as the value is in hex with a length of either 32 or 36 (32 + 4 dashes), though using binary mode would probably be faster if the UUID library and your driver support that.

We've been using sortable epoch-based UUIDs as primary key for two different software products, storing them in Postgresql with the builtin UUID type, utilising binary mode. Performance is good.

swah · on Dec 28, 2018

Yeah - xid, ksuid are all smaller than that. Maybe I could pad with zeros and use the UUID type?

Solar19 · on Dec 28, 2018

Forgive my ignorance. I'm more of a social scientist than a programmer. Questions:

1. Why not go for 16-character strings (instead of 26 or 36), with each character representing 8 bits?

Sure, you'd need 256 possible characters, but it's almost 2019 and Unicode has been with us for decades now. Surely we could be more cosmopolitan than Americentric ASCII and curate 256 characters for an 8-bit encoding?

With a 16-byte string, we could compare and process strings much faster, particularly with SIMD instructions like Intel/AMD's SSE 4.2 string comparison instructions. They're optimized for 16-byte strings and were introduced many years ago in the Nehalem architecture. That's a couple of generations before Sandy Bridge, so any server today is going to support it.

2. What does it mean to be "user-friendly" when it comes to these sorts of IDs? What are some scenarios where users interact with them or communicate or share them with someone or some authority? Crockford wanted his 32 character set to be easy to convey on a telephone, which seems like an expiring use case today. It seems like we should be able to use all sorts of non-ASCII characters now, without resorting to the Unicode Klingon or Tengwar blocks. Do we really need to be able to pronounce them all like Crockford anticipated?

NOTE: Unicode characters beyond the Basic Latin block take two or more bytes each, so we wouldn't be able to use them encoded as Unicode. What I'm advocating is a 256 character set with each character encoded in one byte, strictly for the purposes of generating these sorts of unique IDs represented by compact 16-character strings. Call it Duarte's Base256. All these other BaseN systems seem orthogonal to character encodings, or they just assume ASCII. I guess my idea would require both a character set and an encoding scheme. The latter would be similar to ISO/IEC 8859-15 and Windows 1252, but more complete with 256 printable characters. A lot of them could probably be emoji.

How good or terrible is this idea?

zeroimpl · on Dec 28, 2018

The string representation is for display purposes/information exchange only. I'm sure most implementations would internally store the data in a 16-byte form (eg the C implementation uses __uint128_t), where the data is essentially a 128-bit number.

Given that, inventing a new character set seems pointless, since you'd compare the data using 128-bit binary operations already anyways (as opposed to lexicographical string comparisons). Which leads to the question - how is ULID different from UUID in practice?

dragonwriter · on Dec 28, 2018

The difference is that time is at the front, so if you need to sort by milliseconds the ID was created, you can.

Solar19 · on Dec 29, 2018

Thanks, that helps. By the way, why are they time stamped at the beginning, or at all?

pspeter3 · on Dec 27, 2018

How does this compare to Twitter's Snowflake? https://blog.twitter.com/engineering/en_us/a/2010/announcing...

exyi · on Dec 27, 2018

From the obvious things: ULID relies more on randomness while Twitter's Snowflake relies on some assigned worker IDs

BugsJustFindMe · on Dec 27, 2018

There's a trade-off between assigning IDs up front or randomly generating IDs inside a large space. Randomly generating IDs can be done without a central arbiter, but doesn't provide any real guarantee against collisions. People punt on that problem by making their random IDs larger and hoping for the best.

erik_seaberg · on Dec 27, 2018

The usual argument: the probability of a collision only needs to be as low as the probability of a single-bit error anywhere on the path. That's the best you could possibly do.

giornogiovanna · on Dec 28, 2018

How do you usually estimate the probability of a single-bit error?

erik_seaberg · on Dec 28, 2018

Most errors are detected by checksums or hashes, so measure that across all your hardware (client requests, network hops, server RAM) and estimate how often your checksum should have collided and let an error slip by.

Granted, it's pretty rare to work on a big enough system to have solid data on this.

senderista · on Dec 27, 2018

More precisely: to avoid collisions with high probability, random IDs need to be twice the size of sequential IDs for a fixed number of IDs generated, thanks to the birthday problem. You don’t need to “hope for the best”; the probabilities in question can be precisely quantified: https://en.wikipedia.org/wiki/Birthday_problem.

dragonwriter · on Dec 28, 2018

You still need to hope for the best because probabilities that aren't zero or one aren't a guarantee of anything except for the asymptotic limit of frequency as the number of trials approaches infinity.

If your system isn't resilient to collisions (and distributed ID schemes are usually the chosen to enable a system which isn't), you are gambling. Perhaps with very good odds, but gambling.

tveita · on Dec 27, 2018

They solve roughly the same problem. Snowflake is able to create 64-bit unique IDs at the cost of needing a central coordinating service, this format uses 128 bits for IDs which lets them be created independently as long as you have the correct time.

ivan_gammel · on Dec 27, 2018

What does 128-bit compatibility with UUID mean? I would expect that it will be interoperability on binary level (e.g. allowing to store ULID in database columns of UUID type), but UUID has type information encoded in it - how can ULID address this requirement?

groestl · on Dec 27, 2018

UUID has version and variant information encoded, but there are no fixed bits. So all 128bit numbers are potentially valid UUIDs. For forward compatibility, a database UUID type will set the correct bits when generating a UUID (v4), but many implementations will not check the version/variant flags while processing UUIDs.

BugsJustFindMe · on Dec 27, 2018

edit: you're right. my bad. <strike>UUID4 is just a 128 bit random number.</strike>

That just means they also generate 128 bits.

ivan_gammel · on Dec 27, 2018

You are wrong, please, check the specification. UUID4 is 122 bit random number + 6 bits for variant and version.

cryptonector · on Dec 27, 2018

So you need good clocks (and timesync) and good entropy. Especially for the lexicographic sorting part, you'll need really good clocks. That's fine, if you can get them. But it's not enough, since you get no origin ID, 1ms is a very long time, and you can't sort events occurring in the same ms.

Too · on Dec 28, 2018

It's not always a good idea to expose timestamps to third parties. You would then need to obfuscate the url with an API-id, and in that case all the properties of readability and url-compatibility are mostly redundant as the ULID only circulates internally in your cluster.

wgj · on Dec 27, 2018

This seems like it could be useful, but

> Cryptographically secure source of randomness, if possible

I don't think this should be a goal. If this is for IDs, you usually want to optimize for speed of generating IDs and evenness of distribution (aside from merely reducing collisions.) The top answer at below link has a good top list of hash algorithms. None of them are cryptographic.

https://softwareengineering.stackexchange.com/questions/4955...

jmtulloss · on Dec 27, 2018

I suspect they want a "good" random generator, and a cryptographically secure generator is a known good generator. Without one you don't have a definition of random and you may end up with an unacceptably high chance of collisions.

exyi · on Dec 27, 2018

IMHO, in most cases you don't need performance nor "security", if you simply generate IDs for some documents in DB that generating ID is totally negligible in comparison with storing that on disk. In case it is a performance bottleneck you'll quickly find that out, it's probably very easy to get some flamegraph and see the crypto generator here. However, in case you need security, people quite likely to miss that out until something gets wrong...

It's certainly a compromise, but I like their choice

BugsJustFindMe · on Dec 27, 2018

It seems kinda like their way of saying "if your random sections collide, then you're boned", which of course isn't the same as what they said.

wgj · on Dec 27, 2018

What are you securing? It's a cryptographic hash of what?

exyi · on Dec 27, 2018

It's not hash, you probably want to prevent bad guys from making you ID collisions.

ivan_gammel · on Dec 27, 2018

I'd disagree with you on both. ID generated with a secure source of randomness can be used as an opaque access token, e.g. to share some temporary object via e-mailed link.

At the same time, optimizing the speed of generation that much is not necessary in many use cases, when objects are created much less often than accessed - performance impact won't be noticeable. It will likely be faster than another common choice for the source of IDs - a sequence in remote database.

dchest · on Dec 27, 2018

I agree with your point in general (regarding CSPRNG), but ULID turns out to be not suitable as opaque access token since it increments the random part if it's generated in the same millisecond range. So, if someone gets this token, they can try incrementing it and checking if they can get access to something using this derived token, and will get access if another token was generated at the same millisecond. In fact, this is dangerous and should be mentioned in the documentation.

ivan_gammel · on Dec 27, 2018

Good point, thank you! This problem can be mitigated by using random increments. If initial random value is 79-bit and increments are in range [M, 2^n], where M>1 and n<=78 , by varying values of M and n you can find the right balance between density and security. This way server may also detect brute force attacks on IDs with specific millisecond value (by simply catching the misses) and progressively increase response time to make them ineffective.

ngrilly · on Dec 29, 2018

github.com/oklog/ulid does something similar: https://godoc.org/github.com/oklog/ulid#Monotonic

Aeolun · on Dec 27, 2018

We use a similar approach at my company, with a few extra bits reserved for a machine identifier.

sharpercoder · on Dec 27, 2018

If you need lexicographically sortable uuids, you have a very different problem.

exyi · on Dec 27, 2018

Well, in lot of cases you actually don't need that but it's a nice to have feature. The typical case being just giving IDs to some items you store in database - you can either give them UUIDs and enjoy that you can generate the ID in application and don't have to wait for database to create related entities to that. Or you choose sequential IDs generated by database, and you get that the IDs also have some meaning - you can for example sort by ID in your queries by default as it's roughly equivalent to creation date.

int0x80 · on Dec 27, 2018

To me this is 'just' an optimization. If sorting by ID means 'sort by creation time' then you are just optimizing a timestamp in an entity. Overloading the ID, IOW. For the 'create the ID in the app' yes, but that is orthogonal to the ULID, AFAIK.

jayd16 · on Dec 28, 2018

Whats wrong with actually using a creation date if that's what you actually want?

rco8786 · on Dec 27, 2018

Different than what? Is wanting sortable unique IDs not a legitimate use case for some people?

jmtulloss · on Dec 27, 2018

This is a "holy grail" problem in distributed systems (which this library doesn't solve on its own), but suggesting that it's not a valid problem to solve is confusing. Have you not encountered the use cases or do you prefer a different solution? With either of those objections spelled out this could be an interesting thread!

mmirate3 · on Dec 27, 2018

Information-security can be oversimplified as one big "holy grail" problem unto itself - and when someone presents some software and proclaims it is "the" solution to that problem, but in fact has presented only a partial solution, then we call it "security theatre" and denounce it with extreme prejudice.

I believe the comment you replied to, may be making an analogous argument here about distributed systems.

int0x80 · on Dec 27, 2018

Any reference to this "holy grail" problem?

marknadal · on Dec 27, 2018

We have been doing something similar for a long time, works out great. Glad to see more industry adoption around this!

We also wrote a decentralized clock sync algorithm that can be used where NTP fails, check out https://github.com/amark/gun/blob/master/nts.js !

I find it a little odd they didn't use a separator symbol so that way it doesn't have to overflow after a certain year. Also, then you could have microseconds precision or beyond where it is supported.

Overall good progress getting people onboard with this! Solves a lot of problems before they even start.

RcouF1uZ4gsC · on Dec 27, 2018

> Each component is encoded with the Most Significant Byte first (network byte order).

This seems a surprising choice. Even the PowerPC now supports little endian. I would guess that 95%+ of all software is running in a little endian system and that any software that would use ULID is going to run in a little endian system. Other than for historical compatibility, I don't think there is any reason to use big endian today, and definitely not for greenfield protocols.

adambrenecki · on Dec 27, 2018

Surely if it were the other way around, the timestamp being LE would mean that the resulting strings no longer sort lexicographically in timestamp order?