Re: UnicodeData.txt problem

From: Theo Veenker (
Date: Fri Dec 09 2005 - 16:10:32 CST

  • Next message: Tom Emerson: "Re: UnicodeData.txt problem"

    Kenneth Whistler wrote:
    > This really isn't the forum to discuss all these issues (which really
    > belong in the realm of the UTC per se), but...
    >>What about asking the users (i.e. developers) whether they'd like to
    >>see a redesign of the UCD data files. I find the current structure a
    >>real PITA. Why not simply create one data file for each property
    >>and in the header of each data file a description of that property.
    >>I vote YES.
    > Do *you* know how many properties there are? I sure don't. The UTC
    > is arguing the edge cases all the time. And while it may seem
    > obvious, in many cases it isn't. Even if the approximately 80 some
    > properties extractable from PropertyAliases.txt could all be
    > unambiguously identified, there are another 80 or so lurking
    > in Unihan.txt, with all different statuses.

    I know how many there are (and I support them all, except unihan for
    now), but don't see why it matters how many there are.

    >>You could even create a double set of data files: a new reorganized
    >>set of data files, and a set for backwards compatibility (extracted
    >>from the new set).
    > Yep, but that creates another mountain of maintenance work for
    > each release.

    Yeah you're right. You have to type the "extract-old-format-files"
    command on each release.

    >>Suppose a new binary
    >>property Pattern_Filename (whatever) is invented and data added to
    >>PropList.txt. Now I'm not interested in using the new property in my
    >>software, but I still need to adapt my PropList.txt parser in order
    >>to cope with the added property. If on the other hand the data for
    >>the new property had been put in a new property specific file, I
    >>wouldn't have to change my code at all!
    > You can't know that, actually, because a priori you can't predict
    > what kinds of properties might be defined and whether they would
    > fit into some neat category that your parser was already handling.
    >>Also parsers would be much
    >>simpler and it would therefore be easier to add a new parser for
    >>a new property. One could even create parsers mechanically.
    > If you think every character property is simply another binary
    > property, then sure.... but it ain't that easy, I assure you.

    I beg to differ. It is very doable. You only need to handle four
    types of files: for binary properties, integer/enumerate, simple
    mappings and string mappings.

    >>If we look at the future, say in ten or twenty years time, do you or
    >>the Unicode organization believe the UCD data files will still be
    >>excactly structured/formatted as they are now?
    > Actually, no. There is a substantial effort underway to create
    > a complete XML description of the character properties. It is
    > almost complete for the non-Unihan part of the problem, but the
    > effort has turned up some of the problems I have been alluding
    > to. And even when a candidate XML description is in place, it
    > still needs to be verified by putting in place the scripts that
    > will guarantee that the XML description is equivalent to the
    > existing hodgepodge of definition files.

    That's good to know. It's funny, the UCD files can't be reformatted,
    restructured or whatever because it would break existing parsers,
    unless of course the new format is XML ;-)

    When do you expect this is available, I assume after 5.0?


    This archive was generated by hypermail 2.1.5 : Fri Dec 09 2005 - 16:13:08 CST