From: Doug Ewell (email@example.com)
Date: Fri Dec 03 2004 - 00:21:22 CST
Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:
> All UTF encodings (including the SCSU compressed encoding, or BOCU-8
> which is a variant of UTF-8, or also now the GB18030 Chinese standard
> which is now a valid representation of Unicode) have their pros and
UTF's by definition are stateless and have exactly one valid
representation for each code point. So SCSU, much as I like it, is not
BOCU-1 is also not a UTF, and in particular there is no conceivable way
it can be regarded as "a variant of UTF-8." I have no idea what
"BOCU-8" is. Maybe that one really is a variant of UTF-8.
Though not promulgated by Unicode, GB18030 can be considered a UTF,
since it is really just a mapping from Unicode code points to sequences
of 1, 2, or 4 bytes.
> SCSU is excellent for immutable strings, and is a *very* tiny overhead
> above ISO-8859-1 (note that the conversion from ISO-8859-1 to SCSU is
> extremely trivial, may be even simpler than to UTF-8!)
An ISO 8859-1 string that contains no controls except NUL, CR, LF, and
Tab is *already* in SCSU. No conversion needed.
I appreciate Philippe's support of SCSU, but I don't think *even I*
would recommend it as an internal storage format. The effort to encode
and decode it, while by no means Herculean as often perceived, is not
trivial once you step outside Latin-1.
This archive was generated by hypermail 2.1.5 : Fri Dec 03 2004 - 00:22:34 CST