[unicode] Re: UCS-2 Files

From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Thu Mar 22 2001 - 10:03:00 EST


Tomas McGuinness wrote:
> I have a question relating to UCS-2. I am currently
> developing a product
> that will support UCS-2 and I have been sent several
> documents encoded in
> UCS-2. I have no reader or writer for UCS-2 but I have
> performed Hexdumps in
> UNIX. At the beginning of the UCS-2 characters there are two rogue
> characters 0xFF and 0xFE. Have these characters any importance?

They are quite important, yes. See
http://www.unicode.org/unicode/faq/utf_bom.html#24 for details.

But, beware that they are NOT characters: they are OCTETS (also known as
"bytes")!

The first thing that I'd suggest you to do when starting working with
Unicode and other character sets is to carefully disjoining the terms "byte"
and "character". Better if you also keep the distinction between "octet" (a
series of 8 bits) and "byte" (a series of n bits, where n is often but NOT
always 8).

In brief, those two octets tell you that:

1. It is an Unicode text file.

2. It is in format UCS-2, UTF-16, or UTF-32 (to determine whether it is
UTF-32 you need to read the next two octets: if they are 0x00 0x00, then it
is UTF-32. Else it is either UCS-2 or UTF-16, which basically you don't need
to distinguish).

3. The 16-bit units are little endian, so you have to interpret these
two octets as (0xFF + 0xFE * 256), which yields 0xFEFF, the code of the
"BOM".

4. All subsequent pairs of octets a,b are interpreted the same way: (a
+ b * 256).

Regards.
_ Marco



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:14 EDT