Re: Terminology verification

From: Lars Marius Garshol (larsga@garshol.priv.no)
Date: Thu Oct 30 2003 - 15:51:01 CST


* Lars Marius Garshol
|
| The definition currently says:
|
| <dt>String</dt>
| <dd><p>Strings are sequences of Unicode code points
| conforming to Unicode Normalization Form C <xref to="unicode"/>.</p>

* Kenneth Whistler
|
| I really think this is asking for trouble. A string data type should
| be specified in terms of specific code units, unless you are dealing
| with a level of abstraction where you really are talking about
| *characters* -- in which case any operations you define on such
| abstract strings will also be rather abstract and difficult to tie
| to specific implementations of operations (even such simple things
| as specification of storage and field size, etc.).

Well, this is the level we want to be at. We cannot require a
particular encoding, nor do we want to specify a particular field
size. This data model has to work for anyone, whether they use UTF-7,
-8, -16, or -32.
 
| Also, it is asking for trouble to tie a string data type to a
| particular normalization form. If you do so, that means that you
| would have distinctions between legal and illegal data in your data
| type which would then put you in the position of having to verify
| for legality for any operation involving your string data type.

Correct. We are aware of this, but still chose to do it, for several
reasons.
 
| Contrast the official Unicode definition of a "Unicode string":
|
| "D29a Unicode string: A code unit sequence containing code
| units of a particular Unicode encoding form."

We can't use this, I'm afraid.
    
| That said, you may still want to impose well-formedness conditions
| in your data model for strings. I just don't see it as part of your
| data type definition itself.

That's a good point. Maybe we should make it a constraint outside the
definition of the data type.
 
* Lars Marius Garshol
|
| <p>Strings are equal if they consist of the exact same sequence of
| abstract Unicode characters. This implies that all comparisons are
| case-sensitive.</p>
 
* Kenneth Whistler
|
| You can do this, of course. But you might as well be defining a
| binary comparison on the code *unit* string, which is how this is
| going to end up being implemented, anyway.

Well, we can't really, because we can't tie this to a particular
encoding. But defining comparison as comparison of sequences of scalar
values would work. The question is whether it would be better. What do
you think?
 
* Lars Marius Garshol
|
| Does this make sense? Is "code point" the right term, or should I
| say "scalar value"?
 
* Kenneth Whistler
|
| There is a subtle distinction, since code point includes the
| surrogate code points, which are always ill-formed. Scalar value, by
| definition, excludes the surrogate code points.

Ah, I didn't know that. Clearly we want scalar value, then.
 
* Lars Marius Garshol
|
| And what about "abstract character"?
 
* Kenneth Whistler
|
| That's not what you want, since some abstract characters are not
| encoded (yet), and some abstract characters have two or more
| representations in Unicode. See Figure 2-8 of the Unicode Standard,
| 4.0.

Right.
 
* Lars Marius Garshol
|
| Are two equal sequences of code points in NFC necessarily composed
| of the same sequence of abstract characters?
 
* Kenneth Whistler
|
| Yes. Because the mapping of code points to abstract characters is
| fixed (standardized) by the character encoding itself.

Right. So in that case we might as well lose the term and instead of
confusing people with two different terms (scalar value and abstract
character) limit ourselves to only confusing them with one (scalar
value).

Thanks a lot for your answers! This really did help, and saved us from
at least one embarrassing mistake.

-- 
Lars Marius Garshol, Ontopian         <URL: http://www.ontopia.net >
GSM: +47 98 21 55 50                  <URL: http://www.garshol.priv.no >


This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:25 CST