Re: Whitespace characters in Unicode

From: Leonardo Boiko <leoboiko_at_namakajiri.net>
Date: Thu, 4 Aug 2016 18:28:55 -0300

I'm sorry; I thought that, when you wanted to separate identifiers, it
might be interesting to follow existing regexps definitions; this way your
syntax would play along with already-existing tools (e.g. you'd be making
it easy for someone to pipe your language into grep -P "\p{Whitespace}").

But I was talking out of my depth; I've never worked with defining Unicode
identifiers, so I'm not really qualified to answer. I'm sure Davis and the
others can give better answers to your questions. Meanwhile, I see that
UAX #31 goes further into Unicode identifiers. It says that
Pattern_White_Space is stable (unlike Whitespace, perhaps?), and intended
for use in regexp-like "patterns" which mix literal characters, whitespace,
and syntax (special characters), where the latter two would e.g. require
quoting. For example, Perl has a "/x" flag which makes unquoted
Pattern_White_Space characters be ignored in regexpes (so that you can make
then less illegible).

However, UAX #31 it also gives a Default Identifier Syntax, which bounds
identifiers not by Whitespace but by their start characters, identified by
ID_Start, defined like this:

> ID_Start characters are derived from the Unicode General_Category of
uppercase letters, lowercase letters, titlecase letters, modifier letters,
other letters, letter numbers, plus Other_ID_Start, minus Pattern_Syntax
and Pattern_White_Space code points.

So it makes reference only to Pattern_White_Space and not Whitespace. On
the other hand, I guess the listing above will exclude Whitespace
characters, since they don't count as any of letters, numbers, or
Other_ID_Start?

None of that is guaranteed to be stable, though. UAX #31 includes a
separate definition for "Immutable identifiers", which are, and suggests
various compromises between them.

2016-08-04 17:44 GMT-03:00 Sean Leonard <lists+unicode_at_seantek.com>:

> I read through TR18...it mainly says that <space> == \s == \p{Whitespace}
> == property White_Space is true. Does it say anything else or more
> significant than that, that I'm missing?
>
> Sean
>
>
> On 8/4/2016 1:17 PM, Leonardo Boiko wrote:
>
> What Mark Davis said; also, depending on what you need, consider taking a
> look at the definitions used by Unicode regexpes, at
> http://unicode.org/reports/tr18/ .
>
> 2016-08-04 16:37 GMT-03:00 Sean Leonard <lists+unicode_at_seantek.com>:
>
>> Hi Unicode Folks:
>>
>> I am trying to come up with a sensible sets of characters that are
>> considered whitespace or newlines in Unicode, and to understand the
>> relative stability policy with respect to them. (This is for a formal
>> syntax where the definition of "whitespace" matters, e.g., to separate
>> identifiers, and I want to be as conservative as possible.) Please let me
>> know if the stuff below is correct, or needs work.
>>
>> The following characters / sequences are considered line breaking
>> characters, per UAX #14 and Section 5.8 of UNICODE:
>>
>> CRLF CR LF FF VT NEL LS PS
>>
>> So, basically: U+000A-U+000D, U+0085, U+2028, U+2029, and the combination
>> U+000D U+000A (treated as one line break). These characters / sequences are
>> called "newlines".
>>
>> There will not be any additional code points that are assigned to be line
>> breaks. (Correct?)
>>
>> CRLF, CR, LF, and NEL are also considered "newline functions" or NLF.
>> These are distinguished from other codes (above) that also mean line
>> breaks, mainly because of historical and widespread use of them.
>>
>> There are several formatting characters that affect word wrapping and
>> line breaking, as discussed in those documents...but they are not line
>> breaking characters.
>>
>> ****
>>
>> The following characters are whitespaces: characters (code points) with
>> the property WSpace=Y (or White_Space). This is:
>>
>> newlines
>> U+0020 U+00A0 U+1680 U+2000-200A U+202F U+205F U+3000
>>
>> Assigned characters that are not listed above, can never be whitespace
>> (according to Unicode). However, the set is not closed, so unassigned code
>> points *could* be assigned to whitespace. It is (unlikely? very unlikely?
>> Pretty much never going to happen?) that additional code points will be
>> assigned to whitespace.
>>
>> ****
>>
>> There are some other characters that Unicode does not consider
>> whitespace, but deserve discussion:
>> U+180E MONGOLIAN VOWEL SEPARATOR: <https://codeblog.jonskeet.uk/
>> 2014/12/01/when-is-an-identifier-not-an-identifier-attack-
>> of-the-mongolian-vowel-separator/>
>> <https://codeblog.jonskeet.uk/2014/12/01/when-is-an-identifier-not-an-identifier-attack-of-the-mongolian-vowel-separator/>
>> U+200B ZERO WIDTH SPACE
>> U+200C ZERO WIDTH NON-JOINER
>> U+200D ZERO WIDTH JOINER
>> U+200E LEFT-TO-RIGHT MARK*
>> U+200F RIGHT-TO-LEFT MARK*
>> U+2060 WORD JOINER
>> U+FEFF ZERO WIDTH NON-BREAKING SPACE
>>
>> *These appear in Pattern_White_Space, but Pattern_White_Space excludes
>> U+2000-200A characters, which are obviously spaces. This is confusing and I
>> would appreciate clarification *why* Pattern_White_Space is
>> significantly disjoint from White_Space.
>>
>> ********
>> The borderline characters above are not considered WSpace=Y, but
>> sometimes might have space-like properties. ZWP and ZWNBP are obviously
>> "space" characters, but they never generate whitespace. I suppose that
>> conversely LTRM and RTLM are obviously "not space" characters, but they
>> could generate whitespace under certain circumstances. Ditto for other
>> formatting characters in general (for which the class is much larger).
>>
>> Therefore I guess a Unicode definition of "whitespace" (or "space
>> characters") is: an assigned code point that *always* (is supposed to)
>> generates white space (empty space between graphemes).
>>
>> ********
>>
>> Are there other standards that Unicode people recommend, that have
>> addressed whether certain borderline characters are considered whitespace
>> vs. non-whitespace (e.g., possibly acceptable as an identifier or syntax
>> component)?
>>
>> Regards,
>>
>> Sean
>>
>
>
>
Received on Thu Aug 04 2016 - 16:29:23 CDT

This archive was generated by hypermail 2.2.0 : Thu Aug 04 2016 - 16:29:23 CDT