Re: Whitespace characters in Unicode from Andrea Giammarchi on 2016-08-04 (Unicode Mail List Archive)

From: Andrea Giammarchi <andrea.giammarchi_at_gmail.com>
Date: Thu, 4 Aug 2016 23:19:31 +0100

I'm not a Unicode expert, but I couldn't stop thinking about the following
comic after reading "I am trying to come up with a sensible sets of
characters that are considered whitespace" https://xkcd.com/927/

Apologies for bringing pretty much nothing to this discussion but I'm
pretty sure there's much more to discuss in this ML than another whitespace
set on top of 25 characters already.

Thanks for your patience and your understanding.

Have a great weekend everyone!
Best Regards

On Thu, Aug 4, 2016 at 10:28 PM, Leonardo Boiko <leoboiko_at_namakajiri.net>
wrote:

> I'm sorry; I thought that, when you wanted to separate identifiers, it
> might be interesting to follow existing regexps definitions; this way your
> syntax would play along with already-existing tools (e.g. you'd be making
> it easy for someone to pipe your language into grep -P "\p{Whitespace}").
>
> But I was talking out of my depth; I've never worked with defining Unicode
> identifiers, so I'm not really qualified to answer. I'm sure Davis and the
> others can give better answers to your questions. Meanwhile, I see that
> UAX #31 goes further into Unicode identifiers. It says that
> Pattern_White_Space is stable (unlike Whitespace, perhaps?), and intended
> for use in regexp-like "patterns" which mix literal characters, whitespace,
> and syntax (special characters), where the latter two would e.g. require
> quoting. For example, Perl has a "/x" flag which makes unquoted
> Pattern_White_Space characters be ignored in regexpes (so that you can make
> then less illegible).
>
> However, UAX #31 it also gives a Default Identifier Syntax, which bounds
> identifiers not by Whitespace but by their start characters, identified by
> ID_Start, defined like this:
>
> > ID_Start characters are derived from the Unicode General_Category of
> uppercase letters, lowercase letters, titlecase letters, modifier letters,
> other letters, letter numbers, plus Other_ID_Start, minus Pattern_Syntax
> and Pattern_White_Space code points.
>
> So it makes reference only to Pattern_White_Space and not Whitespace. On
> the other hand, I guess the listing above will exclude Whitespace
> characters, since they don't count as any of letters, numbers, or
> Other_ID_Start?
>
> None of that is guaranteed to be stable, though. UAX #31 includes a
> separate definition for "Immutable identifiers", which are, and suggests
> various compromises between them.
>
>
> 2016-08-04 17:44 GMT-03:00 Sean Leonard <lists+unicode_at_seantek.com>:
>
>> I read through TR18...it mainly says that <space> == \s == \p{Whitespace}
>> == property White_Space is true. Does it say anything else or more
>> significant than that, that I'm missing?
>>
>> Sean
>>
>>
>> On 8/4/2016 1:17 PM, Leonardo Boiko wrote:
>>
>> What Mark Davis said; also, depending on what you need, consider taking a
>> look at the definitions used by Unicode regexpes, at
>> http://unicode.org/reports/tr18/ .
>>
>> 2016-08-04 16:37 GMT-03:00 Sean Leonard <lists+unicode_at_seantek.com>:
>>
>>> Hi Unicode Folks:
>>>
>>> I am trying to come up with a sensible sets of characters that are
>>> considered whitespace or newlines in Unicode, and to understand the
>>> relative stability policy with respect to them. (This is for a formal
>>> syntax where the definition of "whitespace" matters, e.g., to separate
>>> identifiers, and I want to be as conservative as possible.) Please let me
>>> know if the stuff below is correct, or needs work.
>>>
>>> The following characters / sequences are considered line breaking
>>> characters, per UAX #14 and Section 5.8 of UNICODE:
>>>
>>> CRLF CR LF FF VT NEL LS PS
>>>
>>> So, basically: U+000A-U+000D, U+0085, U+2028, U+2029, and the
>>> combination U+000D U+000A (treated as one line break). These characters /
>>> sequences are called "newlines".
>>>
>>> There will not be any additional code points that are assigned to be
>>> line breaks. (Correct?)
>>>
>>> CRLF, CR, LF, and NEL are also considered "newline functions" or NLF.
>>> These are distinguished from other codes (above) that also mean line
>>> breaks, mainly because of historical and widespread use of them.
>>>
>>> There are several formatting characters that affect word wrapping and
>>> line breaking, as discussed in those documents...but they are not line
>>> breaking characters.
>>>
>>> ****
>>>
>>> The following characters are whitespaces: characters (code points) with
>>> the property WSpace=Y (or White_Space). This is:
>>>
>>> newlines
>>> U+0020 U+00A0 U+1680 U+2000-200A U+202F U+205F U+3000
>>>
>>> Assigned characters that are not listed above, can never be whitespace
>>> (according to Unicode). However, the set is not closed, so unassigned code
>>> points *could* be assigned to whitespace. It is (unlikely? very unlikely?
>>> Pretty much never going to happen?) that additional code points will be
>>> assigned to whitespace.
>>>
>>> ****
>>>
>>> There are some other characters that Unicode does not consider
>>> whitespace, but deserve discussion:
>>> U+180E MONGOLIAN VOWEL SEPARATOR: <https://codeblog.jonskeet.uk/
>>> 2014/12/01/when-is-an-identifier-not-an-identifier-attack-of
>>> -the-mongolian-vowel-separator/>
>>> <https://codeblog.jonskeet.uk/2014/12/01/when-is-an-identifier-not-an-identifier-attack-of-the-mongolian-vowel-separator/>
>>> U+200B ZERO WIDTH SPACE
>>> U+200C ZERO WIDTH NON-JOINER
>>> U+200D ZERO WIDTH JOINER
>>> U+200E LEFT-TO-RIGHT MARK*
>>> U+200F RIGHT-TO-LEFT MARK*
>>> U+2060 WORD JOINER
>>> U+FEFF ZERO WIDTH NON-BREAKING SPACE
>>>
>>> *These appear in Pattern_White_Space, but Pattern_White_Space excludes
>>> U+2000-200A characters, which are obviously spaces. This is confusing and I
>>> would appreciate clarification *why* Pattern_White_Space is
>>> significantly disjoint from White_Space.
>>>
>>> ********
>>> The borderline characters above are not considered WSpace=Y, but
>>> sometimes might have space-like properties. ZWP and ZWNBP are obviously
>>> "space" characters, but they never generate whitespace. I suppose that
>>> conversely LTRM and RTLM are obviously "not space" characters, but they
>>> could generate whitespace under certain circumstances. Ditto for other
>>> formatting characters in general (for which the class is much larger).
>>>
>>> Therefore I guess a Unicode definition of "whitespace" (or "space
>>> characters") is: an assigned code point that *always* (is supposed to)
>>> generates white space (empty space between graphemes).
>>>
>>> ********
>>>
>>> Are there other standards that Unicode people recommend, that have
>>> addressed whether certain borderline characters are considered whitespace
>>> vs. non-whitespace (e.g., possibly acceptable as an identifier or syntax
>>> component)?
>>>
>>> Regards,
>>>
>>> Sean
>>>
>>
>>
>>
>
Received on Thu Aug 04 2016 - 17:20:47 CDT

This archive was generated by hypermail 2.2.0 : Thu Aug 04 2016 - 17:20:47 CDT