RE: ISO vs Unicode UTF-8 (was RE: UTF-8 signature in web and email)

From: Carl W. Brown (cbrown@xnetinc.com)
Date: Fri May 25 2001 - 13:21:13 EDT


Peter,

There was another abomination proposed. Oracle rather than adding UTF-16
support proposed that non plane 0 characters be encoded to an from UTF-8 by
encoding each of the surrogate pairs into a separate UTF-8 character.

This way they could encode UTF-16 using the UCS-2 encoding into two 3 byte
UTF-8 characters. UFT-16 to UTF-8 conversion requires that the UTF-16 be
first converted to UTF-32 (decoding the surrogates into a 32 bit integer)
and then encoded into UTF-8. This can be done on a character by character
basis so there is no intermediate buffering requirement.

Carl

-----Original Message-----
From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]On
Behalf Of Peter_Constable@sil.org
Sent: Friday, May 25, 2001 8:29 AM
To: unicode@unicode.org
Subject: ISO vs Unicode UTF-8 (was RE: UTF-8 signature in web and email)

On 05/25/2001 02:13:36 AM Bill Kurmey wrote:

>Are there not 2 versions of UTF-8, the Unicode Standard (maximum of 4
>octets) and the ISO/IEC Annex/Amendment to 10646 (maximum of 6 octets)?

The distinction between the Unicode and ISO versions of UTF-8 is pretty
irrelevant. ISO UTF-8 allows a maximum of 6 octets because it is designed
to accommodate a larger codespace than Unicode, but the portion of the
codespace beyond U+10FFFF is now permanently reserved. For all practical
purposes, the usable ISO codespace is the same as that for Unicode, and
thus the usable ISO UTF-8 sequences are at most 4.

- Peter

---------------------------------------------------------------------------
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <peter_constable@sil.org>



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:18:17 EDT