Re: UCD.html and simple titlecase

From: Markus Scherer (markus.icu@gmail.com)
Date: Tue Jan 20 2009 - 13:35:44 CST

  • Next message: Martin v. Löwis: "Re: UCD.html and simple titlecase"

    I have been reading this email and the related documentation now a couple of
    times, and I have to agree with Martin about the description of the field.
    The latest Proposed Update of UAX #44 says "Note: The simple titlecase may
    be omitted in the data file if the titlecase is the same as the uppercase."
    To me, as to Martin, this means that if UnicodeData.txt field 14 is empty,
    the Simple_Titlecase_Mapping value is the same as
    the Simple_Uppercase_Mapping value -- not necessarily the same as the code
    point itself. It is easy to read this note and overlook the shorthand for
    the default value in the first column, or to be confused by what looks like
    conflicting instructions.

    I think part of the problem is that this note in UAX #44 reads like an
    instruction to someone _constructing_ the data file. What would be clearer
    to users of the UCD would be a note for how to _read_ the data file. In this
    case, the note could read "If the simple titlecase value is omitted, then
    the value is the same as the simple uppercase value." If this is not true,
    then the note should be removed.

    More generally, PU-UAX #44 4.2.8 Default Values already covers all of the
    other cases of omitted mapping values: "For string properties, including the
    definition of foldings, the default value is the code point of the character
    itself." For the purpose of _reading_ UnicodeData.txt, this makes the notes
    on fields 12 & 13 (uppercase & lowercase) redundant.

    For clarity, I think it would be best to remove the notes for the simple
    case mapping fields, and to make sure that the values are listed in all
    cases that might have been confusing. This leaves the documentation of the
    default values to the general comment in 4.2.8 and in the first column of
    the table about UnicodeData.txt.

    Where I disagree with Martin is the contents of UnicodeData.txt. As far as I
    can tell, the titlecase value for U+01C5 originally used to be omitted, but
    was explicitly set to 01C5 starting with UnicodeData-4.0.0.txt This means
    that in the last six years the correct value has been listed explicitly.

    Compare the entry for U+01C5 in the following files:
    http://www.unicode.org/Public/3.2-Update/UnicodeData-3.2.0.txt
    http://www.unicode.org/Public/4.0-Update/UnicodeData-4.0.0.txt

    Best regards,
    markus

    On Mon, Jan 19, 2009 at 12:04 PM, Kenneth Whistler <kenw@sybase.com> wrote:

    > Martin v. Löwis noted:
    >
    > > Currently, UCD.html says about Simple_Titlecase_Mapping
    > >
    > > Note: The simple titlecase may be omitted in the data file if the
    > > titlecase is the same as the uppercase.
    > >
    > > I think this note disagrees with the current UnicodeData.txt.
    > >
    > > For example, UnicodeData has
    > >
    > > 01C5;LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH
    > > CARON;Lt;0;L;<compat> 0044 017E;;;;N;LATIN LETTER CAPITAL D SMALL Z
    > > HACEK;;01C4;01C6;
    > >
    > > So we have:
    > > - upper case: U+01C4
    > > - lower case: U+01C6
    > > - title case: omitted, hence the same as uppercase, hence U+01C4
    >
    > That inference is incorrect. The Simple_Titlecase_Mapping of
    > U+01C5 is U+01C5.
    >
    > Please note the convention for default values: (<code point>),
    > listed at the property itself. That means that if a value is
    > not present, the code point itself is taken as the value of
    > the property for that entry.
    >
    > >
    > > I think this is surprising: U+01C5 is already a titlecase letter,
    > > so its simple titlecase should be U+01C5.
    >
    > It is.
    >
    > >
    > > To fix this, I think one would either have to
    > > a) change UCD.html, to adjust the Note to
    > > The simple titlecase is omitted in the data file if the titlecase is
    > > the same as the code point itself,
    > > or
    >
    > There was a subtle change in the documentation for
    > Simple_Uppercase_Mapping,
    > Simple_Lowercase_Mapping, and Simple_Titlecase_Mapping between
    > Unicode 5.0 and Unicode 5.1. The UCD.html documentation used
    > to say "may be omitted" in the note for all three properties.
    > The problem was that it is *always* omitted for the
    > Simple_Uppercase_Mapping
    > and Simple_Lowercase_Mapping, but the same is not true of
    > Simple_Titlecase_Mapping, because of the existence of the
    > compatibility titlecase letters in the standard.
    > So for Simple_Uppercase_Mapping and Simple_Lowercase_Mapping,
    > the UCD.html for Unicode 5.1 was updated: "may be omitted" -->
    > "is omitted". The text in the note for Simple_Titlecase_Mapping
    > was left as it was.
    >
    > > b) change UnicodeData.txt to explicitly list the titlecase mapping
    > > for titlecase characters as the character itself.
    >
    > I don't think that would help, because the value is already
    > correct.
    >
    > What might help would be updating the text of the note
    > in the Proposed Update for UAX #44 (which is superseding
    > UCD.html) in the future.
    >
    > --Ken
    >
    >
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Tue Jan 20 2009 - 13:38:47 CST