Re: Whitespace characters in Unicode

From: Sean Leonard <lists+unicode_at_seantek.com>
Date: Sun, 7 Aug 2016 16:08:58 -0700

On 8/5/2016 10:07 AM, Markus Scherer wrote:
> On Fri, Aug 5, 2016 at 8:52 AM, Sean Leonard
> <lists+unicode_at_seantek.com <mailto:lists+unicode_at_seantek.com>> wrote:
>
> What makes a character a "whitespace" in Unicode, e.g., why are
> ZWSP and ZWNBSP not "whitespace" even though they clearly say
> "SPACE" in them?
>
>
> I think "white space" basically wants to have an advance width (occupy
> space) but no ink (all white, no black) :-)

Yes, that is the thought that I had as well: whitespace characters
always generate blank space between graphemes, whether horizontal or
vertical.

>
> ZWSP and ZWNBSP affect word and line breaking but have no advance width.

I suppose that these are "SPACE" characters, but not "WHITE space"
characters, since there is no white in them. :)

>
> Note that character names can be misleading, plain wrong, or even just
> misspelled, but they cannot be changed. Best to read the
> documentation. The charts are a good start:
> http://www.unicode.org/charts/PDF/U2000.pdf
> http://www.unicode.org/charts/PDF/UFE70.pdf
>
> In particular, don't build sets of Unicode characters just based on
> character name patterns. Use character properties as much as possible.
>
> What are "Unicode-y" ways to compute word boundaries?
>
>
> http://www.unicode.org/reports/tr29/#Word_Boundaries
>
> Related to prior question--I suppose ZWSP is not "whitespace", but
> like whitespace, it separates words. I suppose that since it is
> not printable, it is "confusing", and therefore should be avoided
> in contexts where the printed representation of Unicode code
> points matters.
>
>
> Depends on what you do.
>
> Normal text needs ZWSP & ZWNBSP, for example for proper word wrapping
> and line breaking in a browser or text field/editor.
>
> They are not allowed in identifiers, and removed from domain names
> (UTS #46).
>
> Why is Pattern_White_Space significantly disjoint from
> White_Space, namely, why does Pattern_White_Space include LTRM and
> RTLM (and notably LS and PS) yet omit the spaces U+1680 and in the
> U+2000 range?
>
>
> We wanted a simple, immutable definition for rule and pattern strings
> that programmers write and maintain. We included LRM and RLM so that
> they can be used (and will be ignored) in rules, for example collation
> rule strings, to keep them moderately readable when they contain RTL
> characters. Typographic spaces are unnecessary in this context, and
> could be confusing.
>
> In hindsight, LS and PS are probably mistakes. When we came up
> with Pattern_White_Space, we still liked the idea of unambiguous
> end-of-line controls, but in practice it looks like no one really uses
> them. Anyone who cares uses markup or rich-text formats. (Markup was
> not common when Unicode was "born".)

I like the premise of LS and PS: one (well, two) unambiguous characters
to rule them all. But the execution was lacking, to put it mildly. And
there aren't two keys on a common keyboard to distinguish between line
and paragraph separation.

On 8/6/2016 11:30 AM, Doug Ewell wrote:
> Additionally, in UTF-8, either LS or PS actually takes more bytes than
> CR plus LF, so the "increased text size" argument also discouraged use
> of the new controls.

That is true, it takes 3 bytes. However, the original UTF-8 proposal
encoded U+0080 - U+207F in two octets: https://en.wikipedia.org/wiki/UTF-8 :
|10xxxxxx| |1xxxxxxx|

So, the space block /just barely makes it/. Was this intentional during
the original design of UTF-8, or just a coincidence? I think it was more
than a coincidence. It is regrettable that the space block was too high
to work in the final version of UTF-8...maybe it should have gone below
U+07FF.

(More motivation for my whitespace question in following message...)

Sean
Received on Sun Aug 07 2016 - 18:28:29 CDT

This archive was generated by hypermail 2.2.0 : Sun Aug 07 2016 - 18:28:30 CDT