Re: newbie: unicode (when used as a coding) = UTF16LE?

From: Doug Ewell (dewell@adelphia.net)
Date: Thu Feb 13 2003 - 02:18:02 EST

Next message: Manoj Jain: "Unicode 4.0 Beta - Glyphs of proposed addition of Characters in Gujarati"

Previous message: starner@okstate.edu: "Re: newbie: unicode (when used as a coding) = UTF16LE?"
In reply to: Jungshik Shin: "Re: newbie: unicode (when used as a coding) = UTF16LE?"
Next in thread: Jungshik Shin: "Re: newbie: unicode (when used as a coding) = UTF16LE?"
Reply: Jungshik Shin: "Re: newbie: unicode (when used as a coding) = UTF16LE?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Zhang Weiwu <weiwuzhang at hotmail dot com> asked:

> Is it that, when people say "unicode" without UTF, they mean UTF16LE?

and Jungshik Shin <jshin at mailaps dot org> responded:

> No, UTF-16LE is just one of many Unicode transformation form(at)s.
> Each UTF has its own pros and cons and you have to choose
> whatever is appropriate for your own need.

but I'm not sure that answered the question Weiwu was really asking.

It is true that when Windows and other Microsoft products refer to
"Unicode," without qualification, they usually mean UTF-16
little-endian. (Note that "UTF-16 little-endian" is not technically the
same as "UTF-16LE"; the former implies the presence of a BOM while the
latter implies that none is present.)

Despite this Microsoft convention, however, it is not true that
"Unicode" automatically means UTF-16, of any type. This was once the
case -- as late as TUS 3.0, we were told that "Plain Unicode text
consists of sequences of 16-bit character codes" (p. 12) -- but it is no
longer true. UTF-8 and UTF-32 are now on equal footing with UTF-16.

If you do include a BOM, I don't see any reason you can't send
little-endian UTF-16 down the line. The "preference" of big-endian
UTF-16 over little-endian has to do with the assumption to be made when
no BOM is present. When there is a BOM, no assumptions are necessary;
software should interpret text as BE or LE depending on the byte
orientation of the BOM.

(BTW, I thought Weiwu's so-called "newbie question" was much better
expressed and demonstrated better understanding of Unicode than many
non-newbie questions I have seen on this list.)

-Doug Ewell
Fullerton, California

Next message: Manoj Jain: "Unicode 4.0 Beta - Glyphs of proposed addition of Characters in Gujarati"
Previous message: starner@okstate.edu: "Re: newbie: unicode (when used as a coding) = UTF16LE?"
In reply to: Jungshik Shin: "Re: newbie: unicode (when used as a coding) = UTF16LE?"
Next in thread: Jungshik Shin: "Re: newbie: unicode (when used as a coding) = UTF16LE?"
Reply: Jungshik Shin: "Re: newbie: unicode (when used as a coding) = UTF16LE?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Feb 13 2003 - 03:06:11 EST