Re: UnicodeData.txt problem

From: Kenneth Whistler (kenw@sybase.com)
Date: Fri Dec 09 2005 - 14:12:15 CST

Next message: Richard Wordingham: "Re: UnicodeData.txt problem"

Previous message: Tom Emerson: "Re: UnicodeData.txt problem"
Maybe in reply to: Werner LEMBERG: "UnicodeData.txt problem"
Next in thread: Theo Veenker: "Re: UnicodeData.txt problem"
Reply: Theo Veenker: "Re: UnicodeData.txt problem"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This really isn't the forum to discuss all these issues (which really
belong in the realm of the UTC per se), but...

> What about asking the users (i.e. developers) whether they'd like to
> see a redesign of the UCD data files. I find the current structure a
> real PITA. Why not simply create one data file for each property
> and in the header of each data file a description of that property.
> I vote YES.

Do *you* know how many properties there are? I sure don't. The UTC
is arguing the edge cases all the time. And while it may seem
obvious, in many cases it isn't. Even if the approximately 80 some
properties extractable from PropertyAliases.txt could all be
unambiguously identified, there are another 80 or so lurking
in Unihan.txt, with all different statuses.

> You could even create a double set of data files: a new reorganized
> set of data files, and a set for backwards compatibility (extracted
> from the new set).

Yep, but that creates another mountain of maintenance work for
each release.

> Suppose a new binary
> property Pattern_Filename (whatever) is invented and data added to
> PropList.txt. Now I'm not interested in using the new property in my
> software, but I still need to adapt my PropList.txt parser in order
> to cope with the added property. If on the other hand the data for
> the new property had been put in a new property specific file, I
> wouldn't have to change my code at all!

You can't know that, actually, because a priori you can't predict
what kinds of properties might be defined and whether they would
fit into some neat category that your parser was already handling.

> Also parsers would be much
> simpler and it would therefore be easier to add a new parser for
> a new property. One could even create parsers mechanically.

If you think every character property is simply another binary
property, then sure.... but it ain't that easy, I assure you.

>
> If we look at the future, say in ten or twenty years time, do you or
> the Unicode organization believe the UCD data files will still be
> excactly structured/formatted as they are now?

Actually, no. There is a substantial effort underway to create
a complete XML description of the character properties. It is
almost complete for the non-Unihan part of the problem, but the
effort has turned up some of the problems I have been alluding
to. And even when a candidate XML description is in place, it
still needs to be verified by putting in place the scripts that
will guarantee that the XML description is equivalent to the
existing hodgepodge of definition files.

--Ken

>
> Best regards,
> Theo
>
>
>
>
>

Next message: Richard Wordingham: "Re: UnicodeData.txt problem"
Previous message: Tom Emerson: "Re: UnicodeData.txt problem"
Maybe in reply to: Werner LEMBERG: "UnicodeData.txt problem"
Next in thread: Theo Veenker: "Re: UnicodeData.txt problem"
Reply: Theo Veenker: "Re: UnicodeData.txt problem"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Dec 09 2005 - 14:14:02 CST