Re: UnicodeData.txt is invalid, flawed, broken, corrupt and wrong

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat Jun 11 2005 - 15:03:48 CDT

  • Next message: Peter Kirk: "Re: Arabic letters separated by markup"

    ----- Original Message -----
    From: "Theodore H. Smith" <delete@elfdata.com>
    To: "ecartis" <unicode@unicode.org>
    Sent: Saturday, June 11, 2005 9:27 PM
    Subject: UnicodeData.txt is invalid, flawed, broken, corrupt and wrong

    >
    > No one from the official Unicode.org company replied to me last time, so
    > I'll try again.
    >
    > Why is it that the entry for Kelvin (a measurement of temperature), has a
    > decomposition, which is listed as a canonical decomposition, to the
    > standard ASCII "K"?
    >
    > This decomposition is actually a compatibility decomposition.
    >
    > How does this cause me problems? I've written a parser for
    > UnicodeData.txt. This parser will extract data for decomposition, and for
    > composition also.
    >
    > Because Kelvin canonically decomposes to K, it follows that K
    > cannonically composes to Kelvin! :o(
    >
    > So my composer will change a word like this: "Kitchen", into "(Kelvin)
    > itchen". Which is just totally wrong. All because UnicodeData.txt is
    > broken.

    Completely wrong!
    "Kitchen" will remain "Kitchen" in all normalized forms.
    only "+265(Kelvin)" will eventually become "+265K".

    The Kelvin symbol is a compatibility character (only encoded for round-trip
    compatibility with legacy encodings) and normalizes to a normal K letter.
    Because of its status of compatibility character, its use is already
    discouraged.

    No need to say that your subject line is extremely unrespectuous. Repeat it
    again and all you'll get is another notice from the Unicode list moderator,
    and may be private insults; but you won't get more help from others. Your
    first introduction to this list is really a failure: Ask for information,
    but please don't insult people just because you don't understand something.

    UnicodeData.txt is NOT invalid, NOT flawed, NOT broken, NOT corrupt, and NOT
    wrong. At least for the Kelvin symbol you indicate.

    There may be issues for some languages, in rare characters, but the case of
    the Kelvin symbol is wellknown and understood since long now: its canonical
    "decomposition mapping" is not a decomposition because it is a "singleton".
    Singleton decomposition mappings are NOT "recomposable".

    So please, reread the specs, notably the Unicode Standard, and its annex
    that completely document the normalized forms.



    This archive was generated by hypermail 2.1.5 : Sat Jun 11 2005 - 16:32:23 CDT