RE: Just an observation

From: Whistler, Ken <ken.whistler_at_sap.com>
Date: Mon, 5 Aug 2013 22:22:37 +0000

Steffen Daode Nurpmeso observed:

> Hello, in UAX #44 i read
>
> Simple_Titlecase_Mapping ...
> Note: If this field is null, then the Simple_Titlecase_Mapping
> is the same as the Simple_Uppercase_Mapping for this character.
>
> So a parser has to be aware of this, automatically falling back to
> the uppercase mapping (index 12) when there is no explicit
> titlecase mapping (index 14).
>
> Given this the following surprised me:
>
> ?0[steffen_at_sherwood unicode]$ <UnicodeData.txt awk 'BEGIN{FS=";"}\
> {if (length($15) && $15 = $13) print}' |wc -l
> 1051
> ?0[steffen_at_sherwood unicode]$ <UnicodeData.txt awk 'BEGIN{FS=";"}\
> {if (length($15) && $15 != $13) print}' |wc -l
> 12
>
> (I.e., 1051 times the redundant mapping is defined.)

Prior to Unicode 5.2, the relevant documentation (in UCD.html) used
to say:

The simple titlecase may be omitted in the data file if the titlecase is the
same as the uppercase.

Someone correctly pointed out that that statement was ambiguous.
It was corrected to the current note, which is both correct and states
the intention of the simple titlecase mapping: that it be equivalent
to the simple uppercase mapping unless it isn't, in which case a different
explicit value will be in the field (the 12 cases you noted).

The redundant titlecase mapping values were not *removed* from
the data file, as there was a significant chance that that would disrupt
parsers which had long been using conventions which expected
explicit values in the field.

--Ken
Received on Mon Aug 05 2013 - 17:25:58 CDT

This archive was generated by hypermail 2.2.0 : Mon Aug 05 2013 - 17:26:05 CDT