Re: UnicodeData.txt problem

From: Theo Veenker (
Date: Fri Dec 09 2005 - 02:32:48 CST

  • Next message: Hans Aberg: "Re: UnicodeData.txt problem"

    Kenneth Whistler wrote:
    > Werner Lemberg asked:
    >>UnicodeData.txt is, as far as I know, the central file describing the
    >>properties of the Unicode characters. As such it is tightly bound to
    >>the corresponding Unicode version, and I wonder why one of the most
    >>important elements, namely a version tag, is missing from this file.
    >>I consider this as a serious problem. Similarly, a copyright notice
    >>together with a license should be included, even if it just points to
    >>a URL holding the complete text.
    > It is a legacy format issue. UnicodeData.txt was the very first
    > of the data files defined for the Unicode Standard -- many years
    > ago. And there are many existing processes that parse it exactly
    > as is. To minimize the problems of compatibility going forward,
    > its format has been frozen for a long time -- and that includes
    > not adapting the comment and version conventions that the other
    > data files have.

    What about asking the users (i.e. developers) whether they'd like to
    see a redesign of the UCD data files. I find the current structure a
    real PITA. Why not simply create one data file for each property
    and in the header of each data file a description of that property.
    I vote YES.

    You could even create a double set of data files: a new reorganized
    set of data files, and a set for backwards compatibility (extracted
    from the new set).

    You're trying to minimize the amount of work developers have to go
    through when they decide to upgrade their software to a new UCD version
    and that is good thing. But IMHO I think holding on to the legacy
    actually creates more work rather than less. Suppose a new binary
    property Pattern_Filename (whatever) is invented and data added to
    PropList.txt. Now I'm not interested in using the new property in my
    software, but I still need to adapt my PropList.txt parser in order
    to cope with the added property. If on the other hand the data for
    the new property had been put in a new property specific file, I
    wouldn't have to change my code at all! Also parsers would be much
    simpler and it would therefore be easier to add a new parser for
    a new property. One could even create parsers mechanically.

    If we look at the future, say in ten or twenty years time, do you or
    the Unicode organization believe the UCD data files will still be
    excactly structured/formatted as they are now?

    Best regards,

    This archive was generated by hypermail 2.1.5 : Fri Dec 09 2005 - 04:35:15 CST