Re: Canonical block names: spaces vs. underscores

From: Ken Whistler <>
Date: Thu, 26 May 2016 09:03:20 -0700

On 5/26/2016 1:17 AM, Mathias Bynens wrote:
> `Blocks.txt` ( lists blocks such as `Cyrillic Supplement`.
> However, `PropertyValueAliases.txt` ( refers to this block as `Cyrillic_Supplement`, with an underscore instead of a space.
> Which is it?
> If proper canonical block names

Well, first of all, "canonical block name" is not a defined term in the
standard. Unlike
normalization of Unicode strings, there is no "normalization" of
property values that
defines a particular form as *the* canonical form to which other strings

> use spaces instead of underscores, why doesn’t `PropertyValueAliases.txt` reflect that?
> If proper canonical block names use underscores instead of spaces, why doesn’t `Blocks.txt` reflect that?

See the matching rules in UAX #44:

and in particular, the matching rule for symbolic values, which applies
in this case:

For enumerated properties, and especially for catalog properties such as
Block and Script,
the value of the property may be multi-word, and the best form to use in
one context might
not be exactly (as in binary string equality exact) the same as in another.

For Blocks.txt, all block names are given with spaces and with the
casing conventions that
would be most consistent with returning values for a block name in an
API. The
property values used in PropertyValueAliases.txt, on the other hand, are
turned into forms that are more identifier friendly, as the typical
context of use for those
values is in regex expressions and the like.

There are invariant rules in place that guarantee that any new property
values for properties
subject to the Loose Matching Rule #3 noted above are always unique in
their namespace,
given the application of that matching rule.

Received on Thu May 26 2016 - 11:03:46 CDT

This archive was generated by hypermail 2.2.0 : Thu May 26 2016 - 11:03:46 CDT