Re: What does it mean to "not be a valid string in Unicode"?

From: Mark Davis ☕ <mark_at_macchiato.com>
Date: Sun, 6 Jan 2013 12:34:02 -0800

Some of this is simply historical: had Unicode been designed from the start
with 8 and 16 bit forms in mind, some of this could be avoided. But that is
water long under the bridge. Here is a simple example of why we have both
UTFs and Unicode Strings.

Java uses Unicode 16-bit Strings. The following code is copying all the
code units from string to buffer.

StringBuilder buffer = new StringBuilder();
for (int i = 0; i < string.length(); ++i) {
  buffer.append(i.charAt(i));
}

If Java always enforced well-formedness of strings, then

   1. The above code would break, since there is an intermediate step where
   buffer is ill-formed (when just the first of a surrogate pair has been
   copied).
   2. It would involve extra checks in all of the low-level string code,
   with some impact on performance.

Newer implementations of strings, such as Python's, can avoid these issues
because they use a Uniform Model, always dealing in code points. For more
information, see also
http://macchiati.blogspot.com/2012/07/unicode-string-models-many-programming.html

(There are many, many discussions of this in the Unicode email archives if
you have more questions.)

Mark <https://plus.google.com/114199149796022210033>
*
*
*— Il meglio è l’inimico del bene —*
**

On Sat, Jan 5, 2013 at 11:14 PM, Stephan Stiller
<stephan.stiller_at_gmail.com>wrote:

>
> If for example I sit on a committee that devises a new encoding form, I
>> would need to be concerned with the question which *sequences of Unicode
>> code points* are sound. If this is the same as "sequences of Unicode
>> scalar values", I would need to exclude surrogates, if I read the standard
>> correctly (this wasn't obvious to me on first inspection btw). If for
>> example I sit on a committee that designs an optimized compression
>> algorithm for Unicode strings (yep, I do know about SCSU), I might want to
>> first convert them to some canonical internal form (say, my array of
>> non-negative integers). If U+<surrogate values> can be assumed to not
>> exist, there are 2048 fewer values a code point can assume; that's good for
>> compression, and I'll subtract 2048 from those large scalar values in a
>> first step. Etc etc. So I do think there are a number of very general use
>> cases where this question arises.
>>
>
> In fact, these questions have arisen in the past and have found answers
> then. A present-day use case is if I author a programming language and need
> to decide which values for <val> I accept in a statement like this:
> someEncodingFormIndependentUnicodeStringType str = <val, specified in
> some PL-specific way>
>
> I've looked at the Standard, and I must admit I'm a bit perplexed. Because
> of C1, which explicitly states
>
> A process shall not interpret a high-surrogate code point or a
> low-surrogate code point as an abstract character.
>
> I do not know why surrogate values are defined as "code points" in the
> first place. It seems to me that surrogates are (or should be) an encoding
> form–specific notion, whereas I have always thought of code points as
> encoding form–independent. Turns out this was wrong. I have always been
> thinking that "code point" conceptually meant "Unicode scalar value", which
> is explicitly forbidden to have a surrogate value. Is this only
> terminological confusion? I would like to ask: Why do we need the notion of
> a "surrogate code point"; why isn't the notion of "surrogate code units [in
> some specific encoding form]" enough? Conceptually surrogate values are
> byte sequences used in encoding forms (modulo endianness). Why would one
> define an expression ("Unicode code point") that conceptually lumps
> "Unicode scalar value" (an encoding form–independent notion) and "surrogate
> code point" (a notion that I wouldn't expect to exist outside of specific
> encoding forms) together?
>
> An encoding form maps only Unicode scalar values (that is all Unicode code
> points excluding the "surrogate code points"), by definition. D80 and what
> follows ("Unicode string" and "Unicode X-bit string") exist, as I
> understand it, *only* in order for us to be able to have terminology for
> discussing ill-formed code unit sequences in the various encoding forms;
> but all of this talk seems to me to be encoding form–dependent.
>
> I think the answer to the question I had in mind is that the legal
> sequences of Unicode scalar values are (by definition)
> ({U+0000, ..., U+10FFFF} \ {U+D800, ..., U+DFFF})* .
> But then there is the notion of "Unicode string", which is conceptually
> different, by definition. Maybe this is a terminological issue only. But is
> there an expression in the Standard that is defined as "sequence of Unicode
> scalar values", a notion that seems to me to be conceptually important? I
> can see that the Standard defines the various "well-formed <encoding form>
> code unit sequence". Have I overlooked something?
>
> Why is it even possible to store a surrogate value in something like the
> icu::UnicodeString datatype? In other words, why are we concerned with
> storing Unicode *code points* in data structures instead of Unicode *scalar
> values* (which can be serialized via encoding forms)?
>
> Stephan
>
>
Received on Sun Jan 06 2013 - 14:39:19 CST

This archive was generated by hypermail 2.2.0 : Sun Jan 06 2013 - 14:39:20 CST