Re: Just an observation from Steffen on 2013-08-06 (Unicode Mail List Archive)

From: Steffen <sdaoden_at_gmail.com>
Date: Tue, 06 Aug 2013 12:31:36 +0200

"Whistler, Ken" <ken.whistler_at_sap.com> wrote:
|Steffen Daode Nurpmeso observed:
|> Hello, in UAX #44 i read
|>
|> Simple_Titlecase_Mapping ...
|> Note: If this field is null, then the Simple_Titlecase_Mapping
|> is the same as the Simple_Uppercase_Mapping for this character.
|>
|> So a parser has to be aware of this, automatically falling back to
|> the uppercase mapping (index 12) when there is no explicit
|> titlecase mapping (index 14).
|>
|> Given this the following surprised me:
|>
|> ?0[steffen_at_sherwood unicode]$ <UnicodeData.txt awk 'BEGIN{FS=";"}\
|> {if (length($15) && $15 = $13) print}' |wc -l
|> 1051
|> ?0[steffen_at_sherwood unicode]$ <UnicodeData.txt awk 'BEGIN{FS=";"}\
|> {if (length($15) && $15 != $13) print}' |wc -l
|> 12
|>
|> (I.e., 1051 times the redundant mapping is defined.)
|
|Prior to Unicode 5.2, the relevant documentation (in UCD.html) used
|to say:
|
|The simple titlecase may be omitted in the data file if the titlecase is the
|same as the uppercase.

This is interesting -- in [1] `Simple_Uppercase_Mapping' had
a note stating

Note: The simple uppercase is omitted in the data file if the
uppercase is the same as the code point itself.

[1] <http://www.unicode.org/Public/5.1.0/ucd/UCD.html>

Similar for `Simple_Lowercase_Mapping'.

|Someone correctly pointed out that that statement was ambiguous.
|It was corrected to the current note, which is both correct and states
|the intention of the simple titlecase mapping: that it be equivalent
|to the simple uppercase mapping unless it isn't, in which case a different
|explicit value will be in the field (the 12 cases you noted).
|
|The redundant titlecase mapping values were not *removed* from
|the data file, as there was a significant chance that that would disrupt
|parsers which had long been using conventions which expected
|explicit values in the field.

That is what i thought why they are still there, without knowing
the history you have pointed out -- i became a bit curious.
Interestingly, for Unicode 3.2 ([2]) the titlecase is also defined as

  Note: This field is omitted if the titlecase is the same as field
  12. For full case mappings, see UAX #21 Case Mappings and
  SpecialCasing.txt.

[2] <http://www.unicode.org/Public/3.2-Update/UnicodeData-3.2.0.html>

For 3.0 ([3]) no such constraint is defined at all, for neither of
the three case mappings.

[3] <http://www.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html>

Hmm. To me, this raises the question why these constraints were
introduced at all. Imho either one adds constraints due to solid
considerations, and enforces them after some period of backward
compatibility, or there simply should be no constraints.

There are parsers (i know of one) which use *only* UnicodeData.txt
for generating tables (using patterns like `SPACE' etc. to join
characters into sets; which seems to have been common practice in
the past -- as in [3], „Case Mappings“: „derivable from the
presence of the terms "CAPITAL" or "SMALL" in the character
name“).

If there is no such extensive guaranteed backward compatibility
for UnicodeData.txt content already today then that should be
noted (i wouldn't know where that is true?), but otherwise it
cannot be that labour-intensive to drop these constraints again,
since nothing had to be done at all?
I.e., are these parsers already broken today?
Just curious…

|--Ken

--steffen
Received on Tue Aug 06 2013 - 05:39:03 CDT

This archive was generated by hypermail 2.2.0 : Tue Aug 06 2013 - 05:39:11 CDT