Re: UCD 3.2.0

From: Theo Veenker (Theo.Veenker@let.uu.nl)
Date: Mon Apr 08 2002 - 04:20:03 EDT


Kenneth Whistler wrote:
>
> Theo Veenker wrote:
>
> > I'd like to make a few remarks about the UCD files.
>
> First of all, while I'd like to thank Theo for going to the
> trouble of checking the data files so carefully, and coming
> up with some genuine errors in the data, I have a couple of
> comments for people who are checking and reporting errors.
>
> 1. The preferred mechanism for reporting errors in data files
> or other errors in the standard is to make use of the
> reporting form on the Unicode website, rather than broadcasting
> email to the open list, in hope that someone will notice and
> take action. Please use:
>
> http://www.unicode.org/unicode/reporting.html
>
> (which you can also find by following the "Contact Us" link
> on the home page)

I apologize. Next time I will use the reporting form. May be somebody
should make a note of this in the readme that accompanies the data files.

[snip]

> > o UnicodeData-3.2.0.txt still uses this notation:
> > 1234;<Blah, First>;Lo;0;L;;;;;N;;;;;
> > 5678;<Blah, Last>;Lo;0;L;;;;;N;;;;;
> > instead of
> > 1234..5678;<Blah, First>..<Blah, Last>;Lo;0;L;;;;;N;;;;;
> > Since all other UCD files use the latter notation why not change this
> > one too? IMHO backward compatibility with existing UCD file parsers
> > shouldn't be an issue in this particular case.
>
> It is an issue for some parsers. (And a burden on me, personally,
> to fix them, since some of them are used in utilities which maintain
> other parts of the Unicode Standard, or the Unicode Collation Algorithm.)
> And we don't know how many other old parsers would blow up if we
> just changed it. The UTC decided to leave it alone for now -- although
> it might modify it in the future.

I know it would break current parsers (actually the new parser implementation
would be a tiny bit simpler) and I won't sleep less if it is kept the way it
is, BUT:

In UnicodeCharacterDatabase.html where the UCD File Format is described it
says: "Files in the UCD use the following format, unless otherwise specified."
What is the point of giving a detailed description of the format if the
phrase "unless otherwise specified" is required. It makes the description
rather useless. Because of this closure, which is as far as I can tell only
required because of the historic notation used in the main UCD data file,
I cannot assume the format for a particular UCD file isn't modified across
releases. Well... I can assume so, but not rely on it. Anyway, I hope the UTC
will decide once to use exactly the same format for all UCD files instead of
more or less the same.

Best regards,
Theo



This archive was generated by hypermail 2.1.2 : Mon Apr 08 2002 - 05:12:19 EDT