RE: What does it mean to "not be a valid string in Unicode"? from Whistler, Ken on 2013-01-04 (Unicode Mail List Archive)

From: Whistler, Ken <ken.whistler_at_sap.com>
Date: Fri, 4 Jan 2013 22:54:43 +0000

Yannis' use of the terminology "not ... a valid string in Unicode" is a little confusing there.

A Unicode string with the sequence, say, <U+0300, U+0061> (a combining grave mark, followed by "a"), is "valid" Unicode in the sense that it just consists of two Unicode characters in a sequence. It is aberrant, certainly, but the way to describe that aberrancy is that the string starts with a defective combining character sequence (a combining mark, with no base character to apply to). And it would be non-conformant to the standard to claim that that sequence actually represented (or was equivalent to) the Latin small letter a-grave. ("à")

There is a second potential issue, which is whether any particular Unicode string is "ill-formed" or not. That issue comes up when examining actual code units laid out in memory in a particular encoding form. A Unicode string in UTF-8 encoding form could be ill-formed if the bytes don't follow the specification for UTF-8, for example. That is a separate issue from whether the string starts with a defective combining character sequence.

For "defective combining character sequence", see D57 in the standard. (p. 81)

For "ill-formed", see D84 in the standard. (p. 91)

http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf

--Ken

> In the book, Fonts & Encodings (p. 61, first paragraph) it says:
>
> ... we select a substring that begins
> with a combining character, this new
> string will not be a valid string in
> Unicode.
>
> What does it mean to not be a valid string in Unicode?
>
> /Roger
>
Received on Fri Jan 04 2013 - 16:56:32 CST

This archive was generated by hypermail 2.2.0 : Fri Jan 04 2013 - 16:56:37 CST