Re: Whitespace characters in Unicode

From: Leonardo Boiko <leoboiko_at_namakajiri.net>
Date: Thu, 4 Aug 2016 17:17:04 -0300

What Mark Davis said; also, depending on what you need, consider taking a
look at the definitions used by Unicode regexpes, at
http://unicode.org/reports/tr18/ .

2016-08-04 16:37 GMT-03:00 Sean Leonard <lists+unicode_at_seantek.com>:

> Hi Unicode Folks:
>
> I am trying to come up with a sensible sets of characters that are
> considered whitespace or newlines in Unicode, and to understand the
> relative stability policy with respect to them. (This is for a formal
> syntax where the definition of "whitespace" matters, e.g., to separate
> identifiers, and I want to be as conservative as possible.) Please let me
> know if the stuff below is correct, or needs work.
>
> The following characters / sequences are considered line breaking
> characters, per UAX #14 and Section 5.8 of UNICODE:
>
> CRLF CR LF FF VT NEL LS PS
>
> So, basically: U+000A-U+000D, U+0085, U+2028, U+2029, and the combination
> U+000D U+000A (treated as one line break). These characters / sequences are
> called "newlines".
>
> There will not be any additional code points that are assigned to be line
> breaks. (Correct?)
>
> CRLF, CR, LF, and NEL are also considered "newline functions" or NLF.
> These are distinguished from other codes (above) that also mean line
> breaks, mainly because of historical and widespread use of them.
>
> There are several formatting characters that affect word wrapping and line
> breaking, as discussed in those documents...but they are not line breaking
> characters.
>
> ****
>
> The following characters are whitespaces: characters (code points) with
> the property WSpace=Y (or White_Space). This is:
>
> newlines
> U+0020 U+00A0 U+1680 U+2000-200A U+202F U+205F U+3000
>
> Assigned characters that are not listed above, can never be whitespace
> (according to Unicode). However, the set is not closed, so unassigned code
> points *could* be assigned to whitespace. It is (unlikely? very unlikely?
> Pretty much never going to happen?) that additional code points will be
> assigned to whitespace.
>
> ****
>
> There are some other characters that Unicode does not consider whitespace,
> but deserve discussion:
> U+180E MONGOLIAN VOWEL SEPARATOR: <https://codeblog.jonskeet.uk/
> 2014/12/01/when-is-an-identifier-not-an-identifier-
> attack-of-the-mongolian-vowel-separator/>
> <https://codeblog.jonskeet.uk/2014/12/01/when-is-an-identifier-not-an-identifier-attack-of-the-mongolian-vowel-separator/>
> U+200B ZERO WIDTH SPACE
> U+200C ZERO WIDTH NON-JOINER
> U+200D ZERO WIDTH JOINER
> U+200E LEFT-TO-RIGHT MARK*
> U+200F RIGHT-TO-LEFT MARK*
> U+2060 WORD JOINER
> U+FEFF ZERO WIDTH NON-BREAKING SPACE
>
> *These appear in Pattern_White_Space, but Pattern_White_Space excludes
> U+2000-200A characters, which are obviously spaces. This is confusing and I
> would appreciate clarification *why* Pattern_White_Space is significantly
> disjoint from White_Space.
>
> ********
> The borderline characters above are not considered WSpace=Y, but sometimes
> might have space-like properties. ZWP and ZWNBP are obviously "space"
> characters, but they never generate whitespace. I suppose that conversely
> LTRM and RTLM are obviously "not space" characters, but they could generate
> whitespace under certain circumstances. Ditto for other formatting
> characters in general (for which the class is much larger).
>
> Therefore I guess a Unicode definition of "whitespace" (or "space
> characters") is: an assigned code point that *always* (is supposed to)
> generates white space (empty space between graphemes).
>
> ********
>
> Are there other standards that Unicode people recommend, that have
> addressed whether certain borderline characters are considered whitespace
> vs. non-whitespace (e.g., possibly acceptable as an identifier or syntax
> component)?
>
> Regards,
>
> Sean
>
Received on Thu Aug 04 2016 - 15:17:21 CDT

This archive was generated by hypermail 2.2.0 : Thu Aug 04 2016 - 15:17:21 CDT