UCS-4, UCS-2, UTF-16, UTF-8

From: ohmson ohmson (ohmson@netscape.net)
Date: Wed Feb 16 2000 - 19:16:42 EST


Hi Folks,

  I have been lurking behind the mailing list for a little
while and have learnt great stuff from this list. I have visited
the unicode.org site and read most of the stuff from there
(we are waiting for the UNICODE 3.0 book to arrive, anyday
now). I also followed Markus Kuhn's faq on writing unicode-
enabled applications on UNIX (hence the UTF-8 bias).

  Our team has gotten ready to write a client/server
prototype that is going to be I18N. One of the big
debates that we get into is whether we should encode
the data in the database in the various format
shown in the subject. I started by listing some obvious
pros and cons and would very much appreciate what you
folks with the necessary development experience think
of it. To give it more perspective, we are using C++
as the programming language.

UCS-4
  pros:
    - no conversion from UNICODE code points to representation,
easiest for programming
  cons:
    - major storage wastage as only about ~1million code points
are defined and furthermore, ~65k are of significant interest.

UCS-2
  pros:
    - no conversion from UNICODE code points to representation,
easiest for programming
    - native to Win NT
  cons:
    - missing out code points beyond the BMP

UTF-16
  pros:
    - all code points are encoded
    - native to Win2000
    - mostly 2 bytes for most natural languages
  cons:
    - need conversion algorithm

UTF-8
  pros:
    - all code points are encoded
    - native to UNIX
    - friendly to sockets programming
  cons:
    - need conversion algorithm

I won't go into the storage of UTF-16/UTF-8 cause i think it
depends on the language (CJK requires 2 bytes in former but
3 bytes in latter).

Thx much, ohmson
  

____________________________________________________________________
Get your own FREE, personal Netscape WebMail account today at http://webmail.netscape.com.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:59 EDT