Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
ULID: Universally Unique Lexicographically Sortable Identifier (github.com/ulid)
238 points by brunoluiz on Dec 27, 2018 | hide | past | favorite | 129 comments


See also Firebase push IDs:

https://firebase.googleblog.com/2015/02/the-2120-ways-to-ens...

Lexicographically sortable identifiers are critical for any distributed data store if you want anything close to consistency. I've run into the issue of not having them and having to settle for some kind of <autoincrement_id><UUID> key and it's a huge PITA. How this wasn't considered database 101 decades ago just blows my mind.

I'd like to see a spec included in this for synchronizing clocks or using RAFT/Paxos for generating ULIDs with strong guarantees on sort order.

Also a minor gripe - I wish that the ULID spec checked for microsecond collisions instead of millisecond, because that would be more useful for realtime networked gaming and simulations.


> Lexicographically sortable identifiers are critical for any distributed data store if you want anything close to consistency.

wat?

This doesn't make any sense to me. Requiring that events be partially ordered if not totally ordered is a consistency requirement, sure, but I'm failing to understand how a database's consistency depends on the data you're putting in the database.


And now you need every host to have an atomic clock too.

Maybe CRDTs would be better.


I use ULID in a distributed system where some clocks are off by a bit, still works well, untill the skew is way off. The time can be extracted from the ULID for verification before INSERT.


Won't removing time bizs increase the chance of collisions? UUIDs rely on a lot of randomness to avoid the issue, but in ULID this is much more limited, and the timestamp fills in for some of it, afaiu


We don't remove the time bits, we inspect them for reasonable values. Eg: never ever older than 2014-01-01 and older than 24h from our clock gets extra logging+meta-data.


I think they're talking about data locality. If a single processing step access data that are "neighbours", in some systems that data is likely served from a single server. This means a) better performance and b) higher likelihood of consistency.

Of course the guarantee is still merely eventual consistency, but data locality does reduce the chance this becomes visible.


> Also a minor gripe - I wish that the ULID spec checked for microsecond collisions instead of millisecond, because that would be more useful for realtime networked gaming and simulations.

If you can manage the difficult task of getting your clocks synced that precisely, it's probably easy to shift two characters over from random to timestamp.


> If you can manage the difficult task of getting your clocks synced that precisely

It's only a matter of time (in both senses I guess). See e.g. https://blogs.technet.microsoft.com/networking/2018/07/18/to...


> How this wasn't considered database 101 decades ago just blows my mind.

Decades ago they didn't have distributed data stores.


What is a database then? They've existed for close to 40 years now.


Until recently (and even recently, for the most part) databases have been, almost entirely, centralized (though often shared) stores, not distributed stores.


At a certain point you'll end up wanting fancy clocks too. I believe this is what Google does with Spanner.


the Go oklog/ulid package now supports that


Supports what?


> I wish that the ULID spec checked for microsecond collisions instead of millisecond

Most computers don't have wall clocks with even more millisecond precision, much less microsecond. The wall clocks are generally OS tick precision (~5-10ms ballpark) with poorer still accuracy.

Seems like the use case is one that is niche. Maybe an extension?


Do you know of any ID systems that have that behavior?


Unfortunately no, but some (many?) REST APIs turn the problem on its head by using private autoincrement IDs in the database and public UUIDs with either a direct 1:1 mapping between the two or a mapping that is unique to each user. This is mostly done for security reasons but I'm not sure if it's considered a best practice because it can introduce additional complexity at the sharding and caching levels (among others):

https://softwareengineering.stackexchange.com/questions/3714...

https://stackoverflow.com/questions/15360245/when-using-uuid...

https://softwareengineering.stackexchange.com/questions/1394...

https://softwareengineering.stackexchange.com/questions/3469...


I'm pretty sure Google's Spanner uses something like Paxos, but you should fact check that, it might just be for replication.


Did anyone else find it ironically irritating that firebase example json was an unordered dictionary instead of a properly ordered list?


Not just that, you can use range queries like this: https://redis.io/commands/zrangebylex

To find events between two specific points.

It can also be used in Dynamodb which is what we use in adtech.


We have used ULIDs in production for over a year now-- and have generated millions of these.

First, the main benefit of ULID is that you can generate the IDs within your own software rather than rely on the database. We can queue them or even reference them before they land in the database. The traditional roundtrip has been eliminated.

Secondly, being able to sort ULIDs is a nice plus, although not that big of a deal. It makes it relatively easy to shard or partition databases, and it provides a convenient sort if you're not looking for extreme accuracy.

ULIDs are also shorter and slightly more user friendly than UUIDs.

In some circumstances we found the actual implementations to be slightly lacking. For example, the JS library for ULID once returned a 25 character string rather than the standard 26 characters, causing a big ruckus that we had to manually resolve.


> First, the main benefit of ULID is that you can generate the IDs within your own software rather than rely on the database. We can queue them or even reference them before they land in the database. The traditional roundtrip has been eliminated.

No you cannot, unless you're running a single threaded server process on a single machine. What you can do is _gamble_ that you probably won't have a collision, which is the same thing you could do with regular UUIDs and you'd be (nearly) guaranteed to never hit a conflict with UUID4 (or probably UUID1/2 if you trusted your mac address uniqueness).

You may find this gamble acceptable and many people do, but you should be aware that pre-generation of UUIDs on independent systems without coordination is not a solvable problem - all attempts to do so rely on the extreme unlikelihood of a collusion to feel good about it or use some coordinated information (like a guaranteed unique mac address).

(Again, if it's good enough for you, right on - but it isn't theoretically safe)


> No you cannot, unless you're running a single threaded server process on a single machine. What you can do is _gamble_ that you probably won't have a collision

This seems like a pointless distinction.

If I did the math right, you can generate 1,000,000 ULIDs per second (1000 per millisecond) for around 50 million years before you can expect to hit your first collision.

I don't know about you, but I'm pretty sure any system I build won't be running 50 million years from now. Not to mention that the timestamp portion of the ULID will overflow in a mere 9000 years.


I did the math too after reading your comment and got the same result: on average, with 1,000,000 ULIDs per second, I'll wait 50 million years before hitting the first collision.

Here is the Python code I used: https://gist.github.com/ngrilly/565bd27f4ad63244f72578844bca...

But I'm curious to know how you computed this?


I agree that collision is not an issue.

On the other hand, your consistency assumptions may be broken if you rely on the sort order for causality relationship.

You can have ULID1 < ULID2 and yet the event associated with ULID1 may be caused by the ULID2 event.


Does your math rely on perfect entropy on machines you don't control?


Yes. I did the math and got the same result, relying on a perfect entropy hypothesis.


The point was that this was not an advantage of ULID over UUID. You need 2.7 * 10^18 v4 UUIDs for a 50% collision chance.


Did you account for long tail events?


I don't see how it is at all a gamble if you understand the trade-offs.

If there is no fatal flaw with the inaccurate/non-deterministic sorting that wall-clock prefix provides and one is utilizing optimistic concurrency (strict create on PK, abort/retry on conflict), I don't see how it is unsafe


The gamble is the removal of a central key registry. If you are assigning these UUID/ULIDs as... UIDs then you need to be able to assign them uniquely... if multiple servers/threads/whatevers can generate and assign them independently then there is a chance you'll suffer from ID collision... it's incredibly rare though.


But with optimistic concurrency, why is this a gamble? It's an overhead that comes as a result of the trade-offs, sure, but it's a finite cost. Collision can be handled without breaking anything at the cost of implementation complexity.


> What you can do is _gamble_ that you probably won't have a collision, which is the same thing you could do with regular UUIDs and you'd be (nearly) guaranteed to never hit a conflict with UUID4 (or probably UUID1/2 if you trusted your mac address uniqueness).

UUID4 is no more safer than ULID since they both rely on randomness.


There are definitely systems out there whose clocks are not accurate to the millisecond. It's not healthy for systems to encourage false assumptions (e.g., that ids monotonically increase).


I was surprised to find nothing about clock synchronization or quality in the README...

> Monotonic sort order (correctly detects and handles the same millisecond)

EDIT: None of these concerns are directly related to the data format, but would be something I'd explain before users make false assumptions.


This guarantees that ULIDs generated by the same machine are monotonic. Of course, the guarantee doesn't apply to different machines.


The failure on overflow is weird.

Since it starts at a random point and can't overflow, even if you generate a small number of IDs per millisecond you have a constant 1/2^79 chance of failing. The chance is small but reachable for a large network. (bitcoin does 2^88 hashes a day)

It could have just wrapped with no problem, because there's no possible node that could generate 2^80 IDs by itself. And if you have multiple nodes it doesn't help there either.


Bitcoin does 2^88 a day but from what I can tell the overflow issue means you'd have to do that much in a millisecond. Seems unlikely?


The value starts at a random point and can't overflow. This means that in a particular millisecond you might start at ZZZZZZZZZZZZZHJ and hit an error after a few hundred IDs. Your chance of hitting an issue is (roughly) proportional to the total number of IDs you generate.


They also say "Within the same millisecond, sort order is not guaranteed" but the carry behavior seems to be intended to preserve order even with a millisecond. I guess they mean order is not guaranteed between different UUID generators with different random parts.


> (bitcoin does 2^88 hashes a day)

Does it? At 50M TH/s (https://www.blockchain.com/en/charts/hash-rate) that's 10^8 TH/s, or 10^8.10^12 = 10^20 hashes per second. There's less than 10^5 seconds so surely that's only 10^25 per day. If I've messed up each of these numbers by an order of magnitude, that would leave it still well under 10^30.

2^88 is absolutely enormous.


Okay, I checked what I calculated earlier. I accidentally entered the scientific notation wrong so that was actually the number of hashes in 100 days.

10^25 is close to right for one day, and 10^25 is equal to 2^83

2^88 is both absolutely enormous and a real number that bitcoin hits in less than a year.

So the point stands that a large deployment could rarely hit a 2^79 random failure case.


Fair enough, I'd mentally worked out the 10^ -> 2^ conversion entirely wrong, sorry about that.


One of the constraints is that lexicographic sorting must order the nodes by generation time. This precludes wrapping. It's a little weird that they didn't just reserve or add some bits for sorting within milliseconds before or instead of having to increment the randomness.


Yeah, they could have constrained the random value to always start with the highest bit set to 0. Then they would be guaranteed to be able to generate 2^79 values per millisecond.

But that's sort of an implementation detail. You could write an implementation that did that and still be compatible with all other implementations.


> One of the constraints is that lexicographic sorting must order the nodes by generation time. This precludes wrapping.

I suppose, if you're particularly worried about the one-node situation.

> It's a little weird that they didn't just reserve or add some bits for sorting within milliseconds before or instead of having to increment the randomness.

There's no particular reason to make the fields separate. If you mask out the top bit of the random number then they can share and also make overflow effectively impossible.

It's also worth noting that there are two bits that go completely unused. They could eat overflow if the layout was slightly rearranged.


> I suppose, if you're particularly worried about the one-node situation

This algorithm doesn't work if you have multiple nodes, so I don't think any other situation is relevant here.

> There's no particular reason to make the fields separate. If you mask out the top bit of the random number then they can share and also make overflow effectively impossible

Yes, this is what I meant by reserving some bits.

> It's also worth noting that there are two bits that go completely unused. They could eat overflow if the layout was slightly rearranged.

Yeah, it's super weird they didn't do that. I guess they just really like having a power of two number of bits.


> This algorithm doesn't work if you have multiple nodes

Then what's the random part for?

> Yes, this is what I meant by reserving some bits.

What I'm saying is, if you separate it out into an increment-field and a random-field, then you need a lot of bits and you need to fundamentally change how it works.

If you merely make sure your random number starts below some threshold, you only need 1 bit, or a small fraction of a bit. You would still increment the random number, but you wouldn't have to worry about hitting the max value if you always start between 0 and 2^79.9, for example.


> Then what's the random part for?

It's for multiple nodes, which is why this algorithm doesn't make any sense for their use case.


Regarding multiple nodes, Wouldn't each node generate a different random number? It mentions not requiring a MAC but it should use some way to seed the rng differently per node, right? However this would defeat the lexicographic sorting.


Yes, each node would have different random values, but sorting works because time is at the start of the UUID so "more significant", and sort order within a millisecond is not guaranteed.


How many milliseconds of downtime per day would be 99th percentile likely to occur in the scenario you describe?

What percentage of milliseconds in the day would that be?


With a code path that's never been tested? It could potentially be a huge problem.


Assuming the coded algorithm works as specified, it’s possible to specify in probability terms what the likelihood of an algorithm outage are.

It’s turtles all the way down from there, but it’s still valuable to scope “best case scenario” with probability math regardless of potential bugs in code.


It's possible to do the analysis, but instead of trusting that it gets done I'd rather use a slightly different method that only has a success path.


It’s worth noting that, since the ID in the same millisecond is incremented, it may suffer with enumerate attack. So it should not be used to generate one time token or object ID used in url address.


> Uses Crockford's base32 for better efficiency and readability

Ugh. Crockford's base32 character set doesn't actually solve any of the problems it sets out to solve. Using it suggests to me some uncritical thinking.

It[0] says things like L is excluded, because "[uppercase] L Can be confused with 1". Ignoring the part where that is wildly inaccurate for any font that I've ever seen, why not then also remove G, 6, B, 8, Z, 2, S, or 5?

Reducing 1/I/i/L/l to just 1 does little to resolve visual ambiguity for users, because a user could just as easily read l or I instead of 1 or O instead of 0, because users don't know your made-up rules, which causes real problems because you often don't control both sides of the channel.

[0] - https://www.crockford.com/wrmg/base32.html


Crockford Base32 folds those ambiguous characters on reading: l, L, I, and i are treated as 1; o and O are treated as 0. The user doesn’t need to know the rules.

From the page you quote:

> ”When decoding, upper and lower case letters are accepted, and i and l will be treated as 1 and o will be treated as 0. When encoding, only upper case letters are used.”


G/6, B/8, Z/2, and S/5 are much more likely to actually be visually ambiguous than L/1.


I have never typo'd a G for a 6, but I sure am glad my password manager shows digits in a different color from letters, because I confuse l/1/I and O/0 all the time.


I have confused these when they were handwritten, but not in typed form in a reasonable font.


1/l is visually ambiguous. Since the encoding is case-insensitive, upper-case L must also be excluded. For the same reason lower-case L is not included in Base58.


All codes are encoded only to uppercase in crockford32. If you actually want to get rid of the ambiguity, get rid of 1 and keep L.


>All codes are encoded only to uppercase.

False. ULID specification clearly says that these identifiers are case-insensitive. There's no "uppercase" requirement anywhere.


crockford3 accepts any case, but the canonical output is uppercase


You didn't read very carefully. 1/I/i/L/l are all interpreted as 1, so it doesn't matter which one they are misreading it as. Do you contest that l can be misread as 1 or I? Because that's what the table means. The code is case-insensitive.


> Do you contest that l can be misread as 1 or I?

I declare that 5 can be misread as S and yet there they are. (And that L looks like 1 or I far less frequently than 5 looks like S)


I'm pretty sure l looks like I more frequently than 5 looks like S. You have to draw the line somewhere.


  it's a1so very different in monospaced fonts where these
  identifiers are 1ike1y to be read.  'l' 1ooks much more
  1ike '1' in these fonts.


Crockford also excludes the letter U, because of potential profanity, which I find incredibly odd. He's afraid of a rare FU embedded somewhere in an ID? Why would we care about that?


You seem upset.

Aren't your complaints solved by simply having the parser resolve inputs 1/I/i/L/l to the symbol 1, and inputs 0/O to the symbol 0?


cuid removes more of the randomness adds a counter and a fingerprint: https://github.com/ericelliott/cuid

The default id in MongoDB does about the same. I always thought the MongoDB identifiers worked well for a lot of use cases.

Its also worth mentioning that integer incrementing ids can scale just fine if you reserve them in large blocks and they are no longer guaranteed to match insertion order, e.g: https://github.com/pingcap/docs/blob/master/sql/mysql-compat...


I like the idea, just also feel 1.21e+24 unique ULIDs per millisecond seems kind of defeated by the millisecond accuracy. This means there are effectively two tolerance values for time at play in the design of this spec that conflict with each other. If we want users to be able to generate ULIDs on such a short timescale (implying it's a realistic use case), then it would seem they should also be able to get comparable accuracy on the timestamp itself.


It would sure be nice if git commit IDs could use something like this. It would be really convenient if you could look at two commit IDs and know which one is older.


What problem does this solve? Why is it necessary to sort a unique id?


Since the ids are generated chronologically, it means you can sort by id instead of created_at. A nice convenience for sure.


It can help operations if you know when an id came from. So that when you get a request id from a user, you don't also have to get a when.


> when you get a request id from a user

Just a public service reminder that "never trust the client" still applies, in case you were imagining user agents generating their own ULIDs to relive servers of the duty or something. Nothing prevents them from sending duplicate or out-of-order IDs.


I've always been intrigued by having clients provide the id. I think I understand why that choice is made, but it does seem unusually error fragile.


This comes out of Twitter Snowflake. The "prior art" links at the bottom are pretty informative.


Using a ULID in a Clustered index?


My experience is that you don't want human readable and user visible to be sortable, but to have as unique prefix (and when you have experienced workers also postfix) as possible. So this is certainly useful, but specifying human readable representation is somewhat redundant.

Another issue is that there are cases when you want to represent the ID as barcode of reasonable size and readability, which invariably leads to decimal-only Code128 with at most ~30 digits.



And another (in Erlang, 128-bits, but uses the MAC address): https://github.com/boundary/flake


And https://github.com/rs/xid

What would be more interesting to me is a benchmark of how would using those as priary keys on Postgres would affect performance.


If the libraries produce 128-bit values, then they could use Postgresql's UUID type. In fact, Postgresql should accept anything as a UUID in text mode as long as the value is in hex with a length of either 32 or 36 (32 + 4 dashes), though using binary mode would probably be faster if the UUID library and your driver support that.

We've been using sortable epoch-based UUIDs as primary key for two different software products, storing them in Postgresql with the builtin UUID type, utilising binary mode. Performance is good.


Yeah - xid, ksuid are all smaller than that. Maybe I could pad with zeros and use the UUID type?


Forgive my ignorance. I'm more of a social scientist than a programmer. Questions:

1. Why not go for 16-character strings (instead of 26 or 36), with each character representing 8 bits?

Sure, you'd need 256 possible characters, but it's almost 2019 and Unicode has been with us for decades now. Surely we could be more cosmopolitan than Americentric ASCII and curate 256 characters for an 8-bit encoding?

With a 16-byte string, we could compare and process strings much faster, particularly with SIMD instructions like Intel/AMD's SSE 4.2 string comparison instructions. They're optimized for 16-byte strings and were introduced many years ago in the Nehalem architecture. That's a couple of generations before Sandy Bridge, so any server today is going to support it.

2. What does it mean to be "user-friendly" when it comes to these sorts of IDs? What are some scenarios where users interact with them or communicate or share them with someone or some authority? Crockford wanted his 32 character set to be easy to convey on a telephone, which seems like an expiring use case today. It seems like we should be able to use all sorts of non-ASCII characters now, without resorting to the Unicode Klingon or Tengwar blocks. Do we really need to be able to pronounce them all like Crockford anticipated?

NOTE: Unicode characters beyond the Basic Latin block take two or more bytes each, so we wouldn't be able to use them encoded as Unicode. What I'm advocating is a 256 character set with each character encoded in one byte, strictly for the purposes of generating these sorts of unique IDs represented by compact 16-character strings. Call it Duarte's Base256. All these other BaseN systems seem orthogonal to character encodings, or they just assume ASCII. I guess my idea would require both a character set and an encoding scheme. The latter would be similar to ISO/IEC 8859-15 and Windows 1252, but more complete with 256 printable characters. A lot of them could probably be emoji.

How good or terrible is this idea?


The string representation is for display purposes/information exchange only. I'm sure most implementations would internally store the data in a 16-byte form (eg the C implementation uses __uint128_t), where the data is essentially a 128-bit number.

Given that, inventing a new character set seems pointless, since you'd compare the data using 128-bit binary operations already anyways (as opposed to lexicographical string comparisons). Which leads to the question - how is ULID different from UUID in practice?


The difference is that time is at the front, so if you need to sort by milliseconds the ID was created, you can.


Thanks, that helps. By the way, why are they time stamped at the beginning, or at all?


How does this compare to Twitter's Snowflake? https://blog.twitter.com/engineering/en_us/a/2010/announcing...


From the obvious things: ULID relies more on randomness while Twitter's Snowflake relies on some assigned worker IDs


There's a trade-off between assigning IDs up front or randomly generating IDs inside a large space. Randomly generating IDs can be done without a central arbiter, but doesn't provide any real guarantee against collisions. People punt on that problem by making their random IDs larger and hoping for the best.


The usual argument: the probability of a collision only needs to be as low as the probability of a single-bit error anywhere on the path. That's the best you could possibly do.


How do you usually estimate the probability of a single-bit error?


Most errors are detected by checksums or hashes, so measure that across all your hardware (client requests, network hops, server RAM) and estimate how often your checksum should have collided and let an error slip by.

Granted, it's pretty rare to work on a big enough system to have solid data on this.


More precisely: to avoid collisions with high probability, random IDs need to be twice the size of sequential IDs for a fixed number of IDs generated, thanks to the birthday problem. You don’t need to “hope for the best”; the probabilities in question can be precisely quantified: https://en.wikipedia.org/wiki/Birthday_problem.


You still need to hope for the best because probabilities that aren't zero or one aren't a guarantee of anything except for the asymptotic limit of frequency as the number of trials approaches infinity.

If your system isn't resilient to collisions (and distributed ID schemes are usually the chosen to enable a system which isn't), you are gambling. Perhaps with very good odds, but gambling.


They solve roughly the same problem. Snowflake is able to create 64-bit unique IDs at the cost of needing a central coordinating service, this format uses 128 bits for IDs which lets them be created independently as long as you have the correct time.


What does 128-bit compatibility with UUID mean? I would expect that it will be interoperability on binary level (e.g. allowing to store ULID in database columns of UUID type), but UUID has type information encoded in it - how can ULID address this requirement?


UUID has version and variant information encoded, but there are no fixed bits. So all 128bit numbers are potentially valid UUIDs. For forward compatibility, a database UUID type will set the correct bits when generating a UUID (v4), but many implementations will not check the version/variant flags while processing UUIDs.


edit: you're right. my bad. <strike>UUID4 is just a 128 bit random number.</strike>

That just means they also generate 128 bits.


You are wrong, please, check the specification. UUID4 is 122 bit random number + 6 bits for variant and version.


So you need good clocks (and timesync) and good entropy. Especially for the lexicographic sorting part, you'll need really good clocks. That's fine, if you can get them. But it's not enough, since you get no origin ID, 1ms is a very long time, and you can't sort events occurring in the same ms.


It's not always a good idea to expose timestamps to third parties. You would then need to obfuscate the url with an API-id, and in that case all the properties of readability and url-compatibility are mostly redundant as the ULID only circulates internally in your cluster.


This seems like it could be useful, but

> Cryptographically secure source of randomness, if possible

I don't think this should be a goal. If this is for IDs, you usually want to optimize for speed of generating IDs and evenness of distribution (aside from merely reducing collisions.) The top answer at below link has a good top list of hash algorithms. None of them are cryptographic.

https://softwareengineering.stackexchange.com/questions/4955...


I suspect they want a "good" random generator, and a cryptographically secure generator is a known good generator. Without one you don't have a definition of random and you may end up with an unacceptably high chance of collisions.


IMHO, in most cases you don't need performance nor "security", if you simply generate IDs for some documents in DB that generating ID is totally negligible in comparison with storing that on disk. In case it is a performance bottleneck you'll quickly find that out, it's probably very easy to get some flamegraph and see the crypto generator here. However, in case you need security, people quite likely to miss that out until something gets wrong...

It's certainly a compromise, but I like their choice


It seems kinda like their way of saying "if your random sections collide, then you're boned", which of course isn't the same as what they said.


What are you securing? It's a cryptographic hash of what?


It's not hash, you probably want to prevent bad guys from making you ID collisions.


I'd disagree with you on both. ID generated with a secure source of randomness can be used as an opaque access token, e.g. to share some temporary object via e-mailed link.

At the same time, optimizing the speed of generation that much is not necessary in many use cases, when objects are created much less often than accessed - performance impact won't be noticeable. It will likely be faster than another common choice for the source of IDs - a sequence in remote database.


I agree with your point in general (regarding CSPRNG), but ULID turns out to be not suitable as opaque access token since it increments the random part if it's generated in the same millisecond range. So, if someone gets this token, they can try incrementing it and checking if they can get access to something using this derived token, and will get access if another token was generated at the same millisecond. In fact, this is dangerous and should be mentioned in the documentation.


Good point, thank you! This problem can be mitigated by using random increments. If initial random value is 79-bit and increments are in range [M, 2^n], where M>1 and n<=78 , by varying values of M and n you can find the right balance between density and security. This way server may also detect brute force attacks on IDs with specific millisecond value (by simply catching the misses) and progressively increase response time to make them ineffective.


github.com/oklog/ulid does something similar: https://godoc.org/github.com/oklog/ulid#Monotonic


We use a similar approach at my company, with a few extra bits reserved for a machine identifier.


If you need lexicographically sortable uuids, you have a very different problem.


Well, in lot of cases you actually don't need that but it's a nice to have feature. The typical case being just giving IDs to some items you store in database - you can either give them UUIDs and enjoy that you can generate the ID in application and don't have to wait for database to create related entities to that. Or you choose sequential IDs generated by database, and you get that the IDs also have some meaning - you can for example sort by ID in your queries by default as it's roughly equivalent to creation date.


To me this is 'just' an optimization. If sorting by ID means 'sort by creation time' then you are just optimizing a timestamp in an entity. Overloading the ID, IOW. For the 'create the ID in the app' yes, but that is orthogonal to the ULID, AFAIK.


Whats wrong with actually using a creation date if that's what you actually want?


Different than what? Is wanting sortable unique IDs not a legitimate use case for some people?


This is a "holy grail" problem in distributed systems (which this library doesn't solve on its own), but suggesting that it's not a valid problem to solve is confusing. Have you not encountered the use cases or do you prefer a different solution? With either of those objections spelled out this could be an interesting thread!


Information-security can be oversimplified as one big "holy grail" problem unto itself - and when someone presents some software and proclaims it is "the" solution to that problem, but in fact has presented only a partial solution, then we call it "security theatre" and denounce it with extreme prejudice.

I believe the comment you replied to, may be making an analogous argument here about distributed systems.


Any reference to this "holy grail" problem?


We have been doing something similar for a long time, works out great. Glad to see more industry adoption around this!

We also wrote a decentralized clock sync algorithm that can be used where NTP fails, check out https://github.com/amark/gun/blob/master/nts.js !

I find it a little odd they didn't use a separator symbol so that way it doesn't have to overflow after a certain year. Also, then you could have microseconds precision or beyond where it is supported.

Overall good progress getting people onboard with this! Solves a lot of problems before they even start.


> Each component is encoded with the Most Significant Byte first (network byte order).

This seems a surprising choice. Even the PowerPC now supports little endian. I would guess that 95%+ of all software is running in a little endian system and that any software that would use ULID is going to run in a little endian system. Other than for historical compatibility, I don't think there is any reason to use big endian today, and definitely not for greenfield protocols.


Surely if it were the other way around, the timestamp being LE would mean that the resulting strings no longer sort lexicographically in timestamp order?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: