Re: What does it mean to "not be a valid string in Unicode"?

From: Stephan Stiller <stephan.stiller_at_gmail.com>
Date: Sat, 5 Jan 2013 23:14:24 -0800

> If for example I sit on a committee that devises a new encoding form, I
> would need to be concerned with the question which *sequences of Unicode
> code points* are sound. If this is the same as "sequences of Unicode
> scalar values", I would need to exclude surrogates, if I read the standard
> correctly (this wasn't obvious to me on first inspection btw). If for
> example I sit on a committee that designs an optimized compression
> algorithm for Unicode strings (yep, I do know about SCSU), I might want to
> first convert them to some canonical internal form (say, my array of
> non-negative integers). If U+<surrogate values> can be assumed to not
> exist, there are 2048 fewer values a code point can assume; that's good for
> compression, and I'll subtract 2048 from those large scalar values in a
> first step. Etc etc. So I do think there are a number of very general use
> cases where this question arises.
>

In fact, these questions have arisen in the past and have found answers
then. A present-day use case is if I author a programming language and need
to decide which values for <val> I accept in a statement like this:
    someEncodingFormIndependentUnicodeStringType str = <val, specified in
some PL-specific way>

I've looked at the Standard, and I must admit I'm a bit perplexed. Because
of C1, which explicitly states

A process shall not interpret a high-surrogate code point or a
low-surrogate code point as an abstract character.

I do not know why surrogate values are defined as "code points" in the
first place. It seems to me that surrogates are (or should be) an encoding
form–specific notion, whereas I have always thought of code points as
encoding form–independent. Turns out this was wrong. I have always been
thinking that "code point" conceptually meant "Unicode scalar value", which
is explicitly forbidden to have a surrogate value. Is this only
terminological confusion? I would like to ask: Why do we need the notion of
a "surrogate code point"; why isn't the notion of "surrogate code units [in
some specific encoding form]" enough? Conceptually surrogate values are
byte sequences used in encoding forms (modulo endianness). Why would one
define an expression ("Unicode code point") that conceptually lumps
"Unicode scalar value" (an encoding form–independent notion) and "surrogate
code point" (a notion that I wouldn't expect to exist outside of specific
encoding forms) together?

An encoding form maps only Unicode scalar values (that is all Unicode code
points excluding the "surrogate code points"), by definition. D80 and what
follows ("Unicode string" and "Unicode X-bit string") exist, as I
understand it, *only* in order for us to be able to have terminology for
discussing ill-formed code unit sequences in the various encoding forms;
but all of this talk seems to me to be encoding form–dependent.

I think the answer to the question I had in mind is that the legal
sequences of Unicode scalar values are (by definition)
    ({U+0000, ..., U+10FFFF} \ {U+D800, ..., U+DFFF})* .
But then there is the notion of "Unicode string", which is conceptually
different, by definition. Maybe this is a terminological issue only. But is
there an expression in the Standard that is defined as "sequence of Unicode
scalar values", a notion that seems to me to be conceptually important? I
can see that the Standard defines the various "well-formed <encoding form>
code unit sequence". Have I overlooked something?

Why is it even possible to store a surrogate value in something like the
icu::UnicodeString datatype? In other words, why are we concerned with
storing Unicode *code points* in data structures instead of Unicode *scalar
values* (which can be serialized via encoding forms)?

Stephan
Received on Sun Jan 06 2013 - 01:20:02 CST

This archive was generated by hypermail 2.2.0 : Sun Jan 06 2013 - 01:20:03 CST