Unicode is not hard. What's hard is the conversions between all these different ...

ygra · on May 29, 2017

If you only need to receive, store, and send text, Unicode is easy enough and you can just treat it as a byte stream. Once you get into things like manipulating text, comparisons and searches, or displaying text, things get hairy and all kinds of fun algorithms from the various Unicode Technical References and Notes make their appearance. Those parts are the ones that increase complexity.

Also, a major reason why Unicode is large and complex is because languages and scripts are large and complex. Unless we all agree on using simple computer-friendly languages and scripts that complexity is not going to change, and the need of working with older scripts (e.g. for historians and researchers) still requires something like Unicode. Unicode is the kind of thing that emerges from a messy world, and unsurprisingly it's messy as well.

jrochkind1 · on May 29, 2017

Unicode is still _way_ less hard than anything else for manipulating text. Global human written language is complicated, unicode is a pretty ingeniously designed standard, it's got solutions that work pretty darn well for almost any common manipulation you'd want to do. Now, everything isn't always implemented or easily accessible on every platform, and people don't always understand what to do with it -- because global human written language is complicated -- but unicode is a pretty amazing accomplishment, quite successful in various meanings of 'succesful'.

nuopnu · on May 29, 2017

It is actually hard.

https://eev.ee/blog/2015/09/12/dark-corners-of-unicode/

But one's sticking with only some part of Unicode support that they understand/need is easy, sure.

Arnt · on May 29, 2017

Meh.

It's hard, because there's a lot more to learn and to do than if you stick to (say) ASCII and ignore the problems ASCII can't handle.

It's easy, because if you want to solve a sizable fraction of all the problems ASCII just gives up on, Unicode's remarkably simple.

In the eyes of a monoglot Brit who just wants the Latin alphabet and the pound sign, unicode probably seems like a lot of moving parts for such a simple goal.

mikeash · on May 29, 2017

Something as simple as moving the insertion point in response to an arrow key requires a big table of code point attributes and changes with every new version of Unicode. Seemingly simple questions like "how long is this string?" or "are these two strings equal?" have multiple answers and often the answer you need requires those big version-dependent tables.

I think Unicode is about as simple as it can possibly be given the complexity of human language, but that doesn't make it simple.

Piskvorrr · on May 29, 2017

A Brit hoping to encode the Queen's English in ASCII is, I'm afraid, somewhat naïve. An American could, of course, be perfectly happy with the ASCII approximation of "naive", but wouldn't that be a rather barbaric solution? ;)

ygra · on May 30, 2017

For anything resembling sanely typeset text you’d also want apostrophes, proper “quotes” — as well as various forms of dashes and spaces. Plus, many non-trivial texts contain words in more than one language. I’d rather not return to the times of in-band codepage switching, or embedding foreign words as images.

bitwize · on May 29, 2017

This is why the development of character sets requires international coördination from the beginning. :)

Piskvorrr · on May 30, 2017

Yeah. And then you'll get Latin-1, because everyone using computers is in Western Europe or uses ASCII ;)

nuopnu · on May 29, 2017

But comparing something to something else and it being easy, doesn't make it easy by itself.

Paraphrasing the joke about new standards: we had a problem, so we created a beatiful abstraction. Now we have more problems. One of the new problem being normalization.

It doesn't undermine the good that Unicode brought, but you can't say to have included some unilib.h and use its functions without understanding all the Unicode quirks and its encodings, because some of the parameters wouldn't even make sense to you, like the same normalization forms.

Arnt · on May 29, 2017

Wait. There are two possible cases:

1. Either your restrict yourself to the kind of text CP437/MCS/ASCII can handle (to name the three codecs in the blog posting). In that case unicode normalisation is a noop, and you can use unicode without understanding all its quirks.

2. Or you don't restrict the input, in which case unicode may be hard, but using CP437/MCS/ASCII will be incomparably harder.

nuopnu · on May 29, 2017

A rocket can take you to the Moon. Is it easy to operate? Or to learn how to? To maintain it and prepare on the ground?

Not just it would be harder, but you couldn't get into space without it at all, so that got comparatively easier.

Is it still all easy, though?

jcranmer · on May 29, 2017

Unicode IS hard. It's hard because concepts that exist in ASCII don't really extend to Unicode, and many of them depend on what locale you're operating in. Things like case conversion (in Turkish, ToUpper("i") should be "İ", not "I"), comparison (where do you put é, ê, and e?), what constitutes a character, word, whitespace, what direction do you write text in, how many spaces do characters take up in the terminal, etc.

Symbiote · on May 29, 2017

Some of these concepts exist when limited to ASCII.

For example, in olden times, or when restricted to ASCII, the Nordic letter "å" is written "aa", but it is still sorted at the end of the alphabet — "Aarhus" will be close to the end of a list of towns.

In Welsh there are several digraphs, single letters written with two symbols. The town "Llanelli" has 6 letters in Welsh. (There are ligatures, but I don't think they're often used: Ỻaneỻi.)

ygra · on May 29, 2017

Indeed, collation, case-insensitive string matching, and probably a bunch of other things must be used with an appropriate locale. That was the case before Unicode and is still the case with Unicode. The only difference is that the tables for how to do it are slightly larger now, but the operation itself isn't (much) more complex.

Flimm · on May 29, 2017

I would edit that to say "as long as you stick with UTF-8 for everything." Unicode defines more than one encoding, not just UTF-8, but also UTF-16 and UTF-32.

devoply · on May 29, 2017

Yeah sorry you are right. Stick to one of the unicode dialects.

fnord123 · on May 29, 2017

It's hard in command line tools like coreutils since there is no setting (afaict) for making sure all string comparisons are normalized. So you end up trying to compare which files using composed vs precomposed glyphs is painful. e.g. make; if the files generated us composed glyphs but you type precomposed glyphs into your makefile then nothing will work despite the filenames appearing to be the same.