Re: Whitespace characters in Unicode

From: Sean Leonard <lists+unicode_at_seantek.com>
Date: Sun, 7 Aug 2016 16:46:27 -0700

On 8/5/2016 10:07 AM, Markus Scherer wrote:
> On Fri, Aug 5, 2016 at 8:52 AM, Sean Leonard
> <lists+unicode_at_seantek.com <mailto:lists+unicode_at_seantek.com>> wrote:
>
> What makes a character a "whitespace" in Unicode, e.g., why are
> ZWSP and ZWNBSP not "whitespace" even though they clearly say
> "SPACE" in them?
>
>
> Any implementation experience from other standards
> authors/implementers who have run into problems with shifty
> whitespace definitions?
>
>
> Use properties, not character name patterns. If you have strong
> reasons not to use a property as-is, then still use it, just with
> inclusion & exclusion overrides.

Short answer: I cannot use character properties, and cannot use
exclusion overrides.

As I have posted publicly, I am proposing some experimental
Unicode-friendly extensions to IETF ABNF (currently in
https://tools.ietf.org/html/draft-seantek-abnf-more-core-rules-05 ,
going to change that around a bit). There is (some) renewed interest in
that part of the work since RFCs will permit UTF-8 in certain places,
and IETF protocols are supposed to march towards "Net-Unicode" per RFC 5198.

Being a BNF, ABNF does not have exclusion, only incremental
alternatives. Character properties would require a runtime library,
which significantly goes against the purpose of (A)BNF.

The current proposed core rules have <UNICODE> (scalar values = doughnut
hole for surrogates) and <BEYONDASCII> (scalar values without the ASCII
range). While these are technically accurate, they will not be
particularly useful for protocol designers as they are over-inclusive.

One of the rules I am working on is <UCHAR>, which is like <CHAR> except
for Unicode. That eliminates the noncharacter code points (which,
technically, are characters...that are defined as "not characters") as
well as NULL, which is already eliminated by <CHAR>.

I was going to avoid extending <VCHAR> (which is U+0021-U+007E, i.e., no
spaces and no control characters) because it's a bit too complicated.
However, there are actual protocols, including a protocol that I am
working on, that define parts of the repertoire as "graphic symbols and
spacing characters", and elsewhere, "graphic symbols" (i.e., no spaces
and no control characters). So the space characters are relevant at a
level beneath requiring a full Unicode runtime to get at the character
properties.

The newline issue is related but separate, and since IETF continues to
use CRLF as the standard for interchange, I don't see a reason to touch
it further.

Best regards,

Sean
Received on Sun Aug 07 2016 - 19:21:33 CDT

This archive was generated by hypermail 2.2.0 : Sun Aug 07 2016 - 19:21:34 CDT