Re: Stateful encoding mechanisms

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu May 19 2005 - 16:51:09 CDT

Next message: Dominikus Scherkl: "Re: AW: AW: ASCII and Unicode lifespan"

Previous message: Magda Danish \(Unicode\): "What is Unicode in Upper Sorbian?"
In reply to: Dean Snyder: "Re: Stateful encoding mechanisms"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

From: "Dean Snyder" <dean.snyder@jhu.edu>
> SURROGATES:
>
> The Unicode Standard 4.1, section 3.9
> "In UTF-16, the code point sequence <004D, 0430, 4E8C, 10302> is
> represented as
> <004D 0430 4E8C D800 DF02>, where <D800 DF02> corresponds to U+10302."
>
> How can you say that, for example, the surrogates in this very example
> in TUS are not used in text content?

A stream of code units is NOT text content. "Text" means a stream of
(abstract) characters, i.e. of *assigned* code points. Nothing guarantees in
a UTF-16 code unit stream that these code units represent text, or even that
the represented codepoints are characters: they may be <unassigned>, i.e.
<reserved> for future allocation, or <non-characters>.

> BOM:
> The Unicode Standard 4.1, section 15.8
> "Detection of U+FFFE at the start of an input stream should be taken as
> a strong indication that the input stream should be byte-swapped before
> interpretation."
>
> Note the use of the word "strong" here, signaling the BOM's ambiguity. U
> +FEFF can occur almost anywhere in a text stream but if it is a BOM it
> is used to interpret the text content, and is therefore, by definition,
> a stateful mechanism. Notice the troublesome possibility of a text
> fragment that happens to begin with U+FEFF used originally as a zero
> width no-break space but now "should be taken as a strong [yet wrong]
> indication that the input stream should be byte-swapped before
> interpretation".

The BOM is NOT a character. The BOM is NOT the code point U+FEFF. A BOM is
only a code unit that may be present within a stream of code units, and
which appears to have the same value as the code unit of the (deprecated and
not recommanded) ZWNBSP character assigned at code point U+FEFF.

In a UTF-16 encoding *scheme* the leading BOM is fully ignorable. But in a
UTF-16 encoding form, there's simply NO BOM and the codepoint U+FEFF is
legal and represents ZWNSP.

You are mixing several levels in the Unicode character model.

Next message: Dominikus Scherkl: "Re: AW: AW: ASCII and Unicode lifespan"
Previous message: Magda Danish \(Unicode\): "What is Unicode in Upper Sorbian?"
In reply to: Dean Snyder: "Re: Stateful encoding mechanisms"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu May 19 2005 - 16:52:02 CDT