RE: (M) best way to handle Unicode data internally?

From: Adrian Havill (havill@threeweb.ad.jp)
Date: Mon Aug 31 1998 - 11:52:47 EDT


> -----Original Message-----
> From: Marco Mussini [mailto:marco.mussini@vim.tlt.alcatel.it]

> In an application that runs mostly in
> Chinese/Japanese, this means wasting space.

Not as much as you'd think for modern non-text processing focused
applications. A typical localized application in Japanese has a great deal
of ASCII based strings in it. You'd only start to pay a real penalty for
storing Japanese in UTF-8 when you're dealing with processing large amounts
of pure plain text. (word processing a book, etc.)

> Unfortunately most OSs that do offer today support for some flavor of
> Unicode in their API offer it in UTF-8 and not UCS16.

Amongst the Un*xes, yes. But the Win32 API is using 16-bit values, not
UTF-8. The CE version is Unicode only. NT version is Unicode or ANSI [sic].
95/98 is stripped down, but the functions that do support Unicode do it in
16-bit.

COM calls (except DAO)-- which is the foundation of ActiveX, are also done
with 16-bit wide Unicode.

Most older OSes (and RDBMS) offer Unicode through UTF-8 because they don't
have a easy way to convert there legacy 8-bit character interfaces to 16-bit
without breaking compatibility.

BeOS, a newer OS, uses UTF-8 internally though-- the developer's guide
claims that this is for compatibility (with ASCII) and space saving reasons.

> Java internally works in UCS16 (ideally) but it is likely to communicate
> to the external world in UTF-8.

Class files store strings in quasi-UTF-8 format. Interestingly, NTFS stores
in 16-bit format. Long-file name FAT and FAT32 are designed to handle 16-bit
Unicode, but it was never implemented in 95.

> Is it reasonable to use UCS16 for external communications?

So long as you know the medium will pass all 8-bit values untouched and both
ends know what byte-order to expect, sure.

If the U+FEFF BOM is at the beginning of the file, both Communicator and
Explorer will autodetect it. XML processors also understand this and will
auto-switch to UCS-2.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:41 EDT