RE: Just an observation

From: Whistler, Ken <ken.whistler_at_sap.com>
Date: Tue, 6 Aug 2013 18:12:58 +0000

Steffen Daode Nurpmeso continued:

> Hmm. To me, this raises the question why these constraints were
> introduced at all. Imho either one adds constraints due to solid
> considerations, and enforces them after some period of backward
> compatibility, or there simply should be no constraints.

What you are talking about in the notes about the case mapping
fields in UnicodeData.txt do not really constitute constraints, but
rather are attempts to clearly document what the nature of the
data is. The Unicode Consortium does maintain true constraints
on various aspects of the data files: those are generally referred
to as the "stability guarantees" or the stability policy:

http://www.unicode.org/policies/stability_policy.html

See also:

http://www.unicode.org/policies/property_value_stability_table.html

There is no stability policy (yet) regarding the titlecase field in particular,
although there could be, I suppose, if the Unicode Technical Committee
(and the Unicode Consortium officers) decided there was a good enough
reason to add one.

In the meantime, the Unicode Technical Committee also runs various
tests on the UCD for each release checking what are termed
"invariants", to look for possible problems when adding new repertoire
or changing properties for existing characters. Some of those
invariants are the subject of stability policies and *must* be honored
when changing the UCD. Others are simply existing patterns (like
the relationship between the titlecase mapping and the uppercase
mapping) which are checked to look for inadvertent introduction
of bonehead errors in the data.

>
> There are parsers (i know of one) which use *only* UnicodeData.txt
> for generating tables (using patterns like `SPACE' etc. to join
> characters into sets; which seems to have been common practice in
> the past -- as in [3], „Case Mappings“: „derivable from the
> presence of the terms "CAPITAL" or "SMALL" in the character
> name“).

That is very bad practice, and should be avoided. The UCD documentation
warns against making assumptions about character properties based
only on character names. It leads to many bad results.

>
> If there is no such extensive guaranteed backward compatibility
> for UnicodeData.txt content already today then that should be
> noted (i wouldn't know where that is true?), but otherwise it
> cannot be that labour-intensive to drop these constraints again,
> since nothing had to be done at all?
> I.e., are these parsers already broken today?
> Just curious…

Parsers which deduce properties based on character names are
definitely broken -- and that would include any case mapping information.

As regards actual constraints, please refer to the stability policies to
see what the Unicode Consortium officially claims to be required
constraints on data changes.

And if the odd edge cases for parsing the legacy data files (and
UnicodeData.txt is the ur-data file with the most legacy status)
seem problematical, the ultimate fix is just to refer to the UCD in XML:

http://www.unicode.org/Public/UCD/latest/ucdxml/

which has a fully rationalized and regular structure, well documented
in UAX #42.

--Ken
Received on Tue Aug 06 2013 - 13:17:26 CDT

This archive was generated by hypermail 2.2.0 : Tue Aug 06 2013 - 13:17:27 CDT