Re: Whitespace characters in Unicode

From: Martin J. Dürst <duerst_at_it.aoyama.ac.jp>
Date: Mon, 8 Aug 2016 16:07:59 +0900

On 2016/08/08 08:08, Sean Leonard wrote:
> On 8/6/2016 11:30 AM, Doug Ewell wrote:
>> Additionally, in UTF-8, either LS or PS actually takes more bytes than
>> CR plus LF, so the "increased text size" argument also discouraged use
>> of the new controls.
>
> That is true, it takes 3 bytes. However, the original UTF-8 proposal

The term "original UTF-8 proposal" is quite misleading, because that
proposal was never labeled as UTF-8. "FSS-UTF draft version" would be
much better.

> encoded U+0080 - U+207F in two octets:
> https://en.wikipedia.org/wiki/UTF-8 :
> |10xxxxxx| |1xxxxxxx|
>
>
> So, the space block /just barely makes it/. Was this intentional during
> the original design of UTF-8, or just a coincidence? I think it was more
> than a coincidence.

Just a coincidence, I'd say. When designing such schemes, trying to be
compact is obviously one of the goals. But "how can I design it so that
these two characters still make it as two bytes" isn't.

> It is regrettable that the space block was too high
> to work in the final version of UTF-8...maybe it should have gone below
> U+07FF.

There aren't too many line breaks (and usually even less paragraph
breaks) in a text, so the overall effect of the encoding length for LS
or PS were really not that much of an issue. The main reason for why
they didn't spread was that everybody was already dealing with several
variants of line breaks and didn't want more of these, even at the
prospect of (potentially, eventually, in the very, very long run maybe)
have only a single one.

Regards, Martin.
Received on Mon Aug 08 2016 - 02:09:33 CDT

This archive was generated by hypermail 2.2.0 : Mon Aug 08 2016 - 02:09:34 CDT