Re: UnicodeData.txt problem

From: Kenneth Whistler (
Date: Fri Dec 09 2005 - 14:12:15 CST

  • Next message: Richard Wordingham: "Re: UnicodeData.txt problem"

    This really isn't the forum to discuss all these issues (which really
    belong in the realm of the UTC per se), but...

    > What about asking the users (i.e. developers) whether they'd like to
    > see a redesign of the UCD data files. I find the current structure a
    > real PITA. Why not simply create one data file for each property
    > and in the header of each data file a description of that property.
    > I vote YES.

    Do *you* know how many properties there are? I sure don't. The UTC
    is arguing the edge cases all the time. And while it may seem
    obvious, in many cases it isn't. Even if the approximately 80 some
    properties extractable from PropertyAliases.txt could all be
    unambiguously identified, there are another 80 or so lurking
    in Unihan.txt, with all different statuses.

    > You could even create a double set of data files: a new reorganized
    > set of data files, and a set for backwards compatibility (extracted
    > from the new set).

    Yep, but that creates another mountain of maintenance work for
    each release.

    > Suppose a new binary
    > property Pattern_Filename (whatever) is invented and data added to
    > PropList.txt. Now I'm not interested in using the new property in my
    > software, but I still need to adapt my PropList.txt parser in order
    > to cope with the added property. If on the other hand the data for
    > the new property had been put in a new property specific file, I
    > wouldn't have to change my code at all!

    You can't know that, actually, because a priori you can't predict
    what kinds of properties might be defined and whether they would
    fit into some neat category that your parser was already handling.

    > Also parsers would be much
    > simpler and it would therefore be easier to add a new parser for
    > a new property. One could even create parsers mechanically.

    If you think every character property is simply another binary
    property, then sure.... but it ain't that easy, I assure you.

    > If we look at the future, say in ten or twenty years time, do you or
    > the Unicode organization believe the UCD data files will still be
    > excactly structured/formatted as they are now?

    Actually, no. There is a substantial effort underway to create
    a complete XML description of the character properties. It is
    almost complete for the non-Unihan part of the problem, but the
    effort has turned up some of the problems I have been alluding
    to. And even when a candidate XML description is in place, it
    still needs to be verified by putting in place the scripts that
    will guarantee that the XML description is equivalent to the
    existing hodgepodge of definition files.


    > Best regards,
    > Theo

    This archive was generated by hypermail 2.1.5 : Fri Dec 09 2005 - 14:14:02 CST