Re: UTF-16 inside UTF-8

From: YTang0648@aol.com
Date: Tue Nov 04 2003 - 12:45:24 EST

Next message: Sue and Maurice Bauhahn: "URL of excellent interview with the creator of an excellent Unicode font, Gentium"

Previous message: Peter Kirk: "Re: Collation contractions and reordering, was: Hebrew composition model, with cantillation marks"
Maybe in reply to: Jill Ramonsky: "UTF-16 inside UTF-8"
Next in thread: Doug Ewell: "Ill-formed sequences (was: Re: UTF-16 inside UTF-8)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

In a message dated 11/4/2003 6:44:05 AM Pacific Standard Time,
Jill.Ramonsky@Aculab.com writes:
Hi,

What is a conforming application supposed to do if, when decoding a UTF-8
stream (or indeed a UTF-32 stream, etc.), it encounters a sequence of bytes which
decodes to U+D800, U+DF00 ?

Of course, if such a sequence were encountered during UTF-16 processing it
would be pretty obvious, but I'm not talking UTF-16 any more. At least, not
directly. Nonetheless, such a sequence could arise if Application A encodes text
to a file using UTF-16, which is then read by Application B (a very old, legacy
application, unaware of the existence of codepoints above U+FFFF) and
re-saved in UTF-8.
It is clear that Application B is not a conforming application to Unicode 3.2
or Unicode 4.0, right?
It is clear that Application A is a conforming application to Unicode 3.2 or
Unicode 4.0, right?

If you have application C, which read whatever the application B write, then
it should not accept illegal UTF-8 sequence which use 3 bytes to encode U+D800
and another 3 bytes to encode U+DF00. This is clear in Unicode 3.2 or Unicode
4.0

This question generalises to ... should all encoding schemes treat surrogate
pairs as surrogate pairs, or just UTF-16 ?

This question generalises further still, to ... do the phrases "surrogate
character" and "surrogate pair" have any meaning whatsoever outside UTF-16?

==================================
Frank Yung-Fong Tang
System Architect, Iñtërnâtiônàl Dèvélôpmeñt, AOL Intèrâçtívë Sërviçes
AIM:yungfongta mailto:ytang0648@aol.com Tel:650-937-2913
Yahoo! Msg: frankyungfongtan

John 3:16 "For God so loved the world that he gave his one and only Son, that
whoever believes in him shall not perish but have eternal life.

Does your software display Thai language text correctly for Thailand users?
-> Basic Conceptof Thai Language linked from Frank Tang's
Iñtërnâtiônàlizætiøn Secrets
Want to translate your English text to something Thailand users can
understand ?
-> Try English-to-Thai machine translation at
http://c3po.links.nectec.or.th/parsit/

Next message: Sue and Maurice Bauhahn: "URL of excellent interview with the creator of an excellent Unicode font, Gentium"
Previous message: Peter Kirk: "Re: Collation contractions and reordering, was: Hebrew composition model, with cantillation marks"
Maybe in reply to: Jill Ramonsky: "UTF-16 inside UTF-8"
Next in thread: Doug Ewell: "Ill-formed sequences (was: Re: UTF-16 inside UTF-8)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Nov 04 2003 - 13:48:08 EST