Re: Structuring the undefined region

From: Kenneth Whistler (
Date: Mon Jul 22 1996 - 13:40:15 EDT


I have to agree with Asmus on this one. The encoding committees
(Unicode or ISO) simply will not operate on a basis of trying
to preallocate codepoints with presumptive properties. There are
too many factors involved in reaching standardization for the
encoding of any script or set of individual characters for such
presumptions to be safe for implementations to depend on into
the future. Even if the Unicode Technical Committee could agree
on, say, a (U+XXXX)+1 rule for future-encoded lowercase pairs,
there would be a near 100% certainty that such a rule would fail
in practice because of some oddity of particular characters proposed
for Latin, Greek, Cyrillic, or other cased scripts. Just one
potential example: encoding a new uppercase form of an already-encoded
lowercase letter (as, for example, in IPA). And that is before
the vagaries of national voting on the 10646 amendments come into
the picture.

> Such a paper might be useful in many applications, but I do not believe
> the techniques you describe provide an acceptable solution for my
> application. Let me explain my dilemma.
> I have a multi-terabyte database. Someone inserts a new string into the
> database. The place I put this data is crucially dependent on various
> factors, including the case of each character within the string. If the
> character is unknown, I can either reject it (which I believe violates
> the Unicode standard), or place it as best I can not knowing whether case
> information is important.

I think there is a misconception here. You need to specify what version
of Unicode your database supports. If the database supports Unicode 1.1
and somebody sends Unicode 2.0 data, you already have a bigger problem --
indices based on Unicode 1.1 won't be valid for Unicode 2.0, and you
don't have the resources in place to handle the 2.0 data. That situation
requires upgrade and conversion-- or you just treat Unicode 2.0 data
as a new and separate encoding and don't mix it with Unicode 1.1.

If the database supports Unicode 2.0 and you get some data in a string purporting
to be Unicode but containing characters which are unassigned in Unicode 2.0,
then it is perfectly conformant to reject the data. You would be dealing
with "bad data", even if it came from some future extension of Unicode,
and it should be (part of) the job of the database to keep that "bad data"
out of the tables. As it stands right now, the nonconformant thing to do
would be to process any unassigned characters while making assumptions as
to their identities or properties (i.e. that they are uppercase or lowercase).

> Assuming I do place the data within the database, and later on the
> standard defines the character in some other way then my default
> assumptions, then the string will have been placed wrongly. This means
> when I upgrade the database to the new Unicode standard, vast amounts
> of character data will have to be searched and possibly the rows
> associated with this data moved. All index tables of character data
> have to be verified, and rebuilt. This may be acceptable for a small
> database, but is quite unacceptable for the size databases our customers
> require. I would, therefore, be interested in making wiser default
> assumptions, which is why I am interested in this added structuring of
> the undefined regions of Unicode.

There is another way to alleviate, though not eliminate, this problem. If you
normalize you data store to always use full decompositions for the scripts
with casing, you would insulate your case-dependent operations from most
Latin, Greek, or Cyrillic additions to the standard in the future, at the
cost of larger string fields. Almost all the baseform letters anybody ever
conceived of are already in the standard now. Most additions trickling in are
precomposed forms with one or more accents on the baseform. And case is
defined on the baseform, not on the accented combinations. Compression
techniques can resolve some of the string expansion issues. And you could
always look for hybrid techniques where only the indices depend on calculating
the normalized forms, while the data store uses precomposed characters.

Note that the decompositional approach also helps with the collation
order problem, though again it won't save you from all upgrade issues.
The decomposed forms are closer to the sortkeys you have to build, and
with generative rules you are more likely to already have new encoded characters
accounted for than if you just wait to put new precomposed forms into their
place in a collation table. Note that for the collation issue, I can see
no conceivable way that the standardization committees could or would assign
presumptive properties to unassigned coding positions; that way lies madness.

--Ken Whistler
> I am always interested in alternative approaches, but I do not see an
> acceptable one here.
> *

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:31 EDT