Re: UTF-8 syntax (RE: UTF-8S (was: Re: ISO vs Unicode UTF-8))

From: Mark Davis (markdavis34@home.com)
Date: Tue Jun 05 2001 - 17:16:03 EDT


1. By strict, I meant "excludes irregular sequences"
2. To be precise, U+D800 and U+DC00 are code points and do have
interpretations. They are surrogates. They are *not* characters.

----- Original Message -----
From: <Peter_Constable@sil.org>
To: <unicode@unicode.org>
Sent: Tuesday, June 05, 2001 10:31
Subject: Re: UTF-8 syntax (RE: UTF-8S (was: Re: ISO vs Unicode UTF-8))

>
> >I am a little bit confused. I re-read conformance rules and the UTF-8
> >Corrigendum, and I could find these two things:
> >
> >1) The difference between "lenient" vs. "strict" parsers.
>
> That has to do with XML conformance, not Unicode. You were looking in the
> wrong spec.
>
>
> >2) The rule that an UTF-8 sequence like ED A0 80 ED B0 80 should be
> >interpreted (by a lenient parser) as <U+10000> rather than <U+D800
> U+DC00>.
>
> Note that U+D800 and U+DC00 are not interpretable code points. They only
> make sense as code units in the UTF-16 encoding form. Your question was
> relating to the coded character set, and on that level there is only one
> possibility: U+10000.
>
>
> >The fact that a "strict" UTF-8 parser rejects sequences (such as ED A0 80
> ED
> >B0 80) explicitly mentioned as legal seems even against my idea of
> >conformance.
>
> In Unicode terms, that sequence is legal but irregular. In XML terms, that
> sequence is illegal. Again, two different specs.
>
>
> >Or, as a minimum, it seems to me a sort of higher-level
> >protocol that imposes private syntactical constraints to otherwise legal
> >Unicode text.
>
> That's what it is. Note that there's no reason at all why the XML spec
> can't be more restrictive. There may be some things that are reasonable in
> some contexts but not in others. XML requires (recommends?) data to be
> normalised in normal form C. That imposes private (well, open actually,
but
> private in the sense of limited to that protocol) constraints against
> otherwise legal Unicode character sequences.
>
>
> - Peter
>
>
> --------------------------------------------------------------------------
-
> Peter Constable
>
> Non-Roman Script Initiative, SIL International
> 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
> Tel: +1 972 708 7485
> E-mail: <peter_constable@sil.org>
>
>
>



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT