Re: Unicode in source

From: John Cowan (cowan@locke.ccil.org)
Date: Thu Jul 22 1999 - 17:15:46 EDT


G. Adam Stanislav wrote:
 
> Internally a program will presumably decode UTF-8 into whatever format it
> uses. As for being stored on disk, what if the disk is on a LAN consisting
> of PC's and Macs? Should it be stored in little-endian or big endian order?

Either way, but with an appropriate BOM, and good software will be
able to cope.
 
> Besides, UTF-16 can only contain the first plane.

No, that's UCS-2 (which is moribund). UTF-16 handles planes 0-0x10,
which is rather more than all the planes there will ever be.
Current plans are 1 for obscure and archaic scripts, 2 for
obscure and archaic Han characters, 0xE for special magic,
and 0xF and 0x10 for private use.

> Even though, strictly
> speaking, Unicode is 16-bit, the ISO standard (is it 10646?) is 32-bit.

31-bit. But the codes above 0010FFFF will never be assigned.

> > o There is less text expansion for non-Latin languages.
>
> Yes, but with a well written expansion library (that I have been proposing)
> it happens fast and is completely transparent to the compiler writer.

I think the issue is speed, not space. UTF-16 can replace double-byte
character sets fairly easily, but UTF-8 makes for 50% expansion.
 
> Again, that can be completely transparent. More importantly, TCHAR is of
> different sizes in different OS's. For example, under Windows 95+/NT, TCHAR
> is 16 bits wide. Under FreeBSD (and probably other Unices) it is 32 bits
> wide.

I think you are confusing wchar_t (a C standard) with TCHAR (a Microsoft
idea). TCHAR is 16 bits in Unicode mode and 8 bits in "ANSI" (8-bit
code page) mode.
 
> But editors on both system can handle this minor quirk.

Some editors. Try Notepad (the standard Windows plaintext editor),
which can cope with UTF-16 fine but is baffled by bare-LF.

-- 
	John Cowan	http://www.ccil.org/~cowan	cowan@ccil.org
Schlingt dreifach einen Kreis um dies! / Schliesst euer Aug vor heiliger Schau,
Denn er genoss vom Honig-Tau / Und trank die Milch vom Paradies.
			-- Coleridge / Politzer



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:48 EDT