Re: Terminology verification

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Oct 30 2003 - 15:17:26 CST


Lars Marius Garshol asked:

> I'm working on a specification for a data model and would like to
> check that my definition of the string type makes sense.

Well, language designers and data modelers may want to chime in
with alternate opinions, but here is my two cents on this topic.

>
> The definition currently says:
>
> <dt>String</dt>
> <dd><p>Strings are sequences of Unicode code points
> conforming to Unicode Normalization Form C <xref to="unicode"/>.</p>

I really think this is asking for trouble. A string data type
should be specified in terms of specific code units, unless
you are dealing with a level of abstraction where you really
are talking about *characters* -- in which case any operations
you define on such abstract strings will also be rather
abstract and difficult to tie to specific implementations of
operations (even such simple things as specification of
storage and field size, etc.).

Also, it is asking for trouble to tie a string data type to
a particular normalization form. If you do so, that means that
you would have distinctions between legal and illegal data
in your data type which would then put you in the position
of having to verify for legality for any operation involving
your string data type.

Contrast the official Unicode definition of a "Unicode string":

"D29a Unicode string: A code unit sequence containing code
   units of a particular Unicode encoding form."
   
That then lets you go on to define a "Unicode 8-bit string",
a "Unicode 16-bit string" or a "Unicode 32-bit string", depending
on which encoding form is appropriate for your purposes.

Note that the definition of the string per se does not even
require the content of the "Unicode string" to be well-formed,
because to do so puts constraints on the efficiency of low-level
string processing. Even less so would the definition of the
string require the data to be in a particular normalization
form.

That said, you may still want to impose well-formedness conditions
in your data model for strings. I just don't see it as part
of your data type definition itself. If you want the data,
at some appropriate level of the data abstraction, to always
be nominally in NFC, that would be fine; it is comparable to
the way some commercial databases handle Unicode data,
normalizing on input, so that internal comparisons are always
done on normalized data strings and so that ill-formed data
doesn't make it into the stores.

> <p>Strings are equal if they consist of the exact same sequence of
> abstract Unicode characters. This implies that all comparisons are
> case-sensitive.</p>

You can do this, of course. But you might as well be defining
a binary comparison on the code *unit* string, which is how
this is going to end up being implemented, anyway.

>
> Does this make sense? Is "code point" the right term, or should I say
> "scalar value"?

There is a subtle distinction, since code point includes the
surrogate code points, which are always ill-formed. Scalar
value, by definition, excludes the surrogate code points.

> And what about "abstract character"?

That's not what you want, since some abstract characters are
not encoded (yet), and some abstract characters have two
or more representations in Unicode. See Figure 2-8 of the
Unicode Standard, 4.0.

See more discussion of this in Section 2.7 of the Unicode Standard,
4.0.

> Are two equal
> sequences of code points in NFC necessarily composed of the same
> sequence of abstract characters?

Yes. Because the mapping of code points to abstract characters
is fixed (standardized) by the character encoding itself.

--Ken

>
> Thanks for any help!
>
> --
> Lars Marius Garshol, Ontopian <URL: http://www.ontopia.net >
> GSM: +47 98 21 55 50 <URL: http://www.garshol.priv.no >



This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:25 CST