RE: UTF-8 and UTF-16 issues

From: Jones, Bob (
Date: Wed Jun 28 2000 - 17:48:30 EDT

Has anyone out there taken a cross platform non-Unicode enabled legacy
application and converted it to run UTF-8 instead of UTF-16? I've read
Markus Kuhn's UTF-8/Unicode FAQ at and while it was helpful, it
only addresses Unix. I also have to consider Windows and the AS/400. With
Windows, I assume you would have to build every thing with _UNICODE defined,
but leave your strings as char * or char arrays and that at every input
point convert from UTF-16 to UTF-8, or is there some other way within
Windows to tell the OS to give you data in Unicode, especially in a chosen
encoding scheme?

If you have done this with a legacy app, what problems did you run into?
Would you do it that way again or would you go ahead and bite the bullet and
modify all your code to be able to handle UTF-16? It seems to me that the
big advantage of processing in UTF-16 instead of UTF-8 is sizes are
consistent for a given number of characters, but the big advantage of UTF-8
is that normal string manipulation still works and less code needs to be
modified. What other pros and cons for the different encoding schemes are

Also, what kind of tricks are used to deal with database column sizes?
Currently, our applications run on SQL Server, Oracle, and DB2/400 and they
all handle Unicode differently. Right now we have the same database column
layout no matter which database is used. I suspect that may have to change,
i.e. CHAR(10) becomes NCHAR(10) on SQL Server, CHAR(30) on Oracle, and
CHAR(20) on DB2/400.


Bob Jones

-----Original Message-----
From: Edward Cherlin []
Sent: Sunday, June 25, 2000 7:01 PM
To: Unicode List
Subject: Re: UTF-8 and UTF-16 issues

At 2:48 PM -0800 6/19/00, Markus Scherer wrote:
>"OLeary, Sean (NJ)" wrote:
> > UTF-16 is the 16-bit encoding of Unicode that includes the use of
> > surrogates. This is essentially a fixed width encoding.
>certainly not. utf-16, of course, is variable-width: 1 or 2 16-bit
>units per character. certainly the iuc discussion did not spread
>this under "utf-16" but possibly as "ucs-2".

The essential distinction that Sean refers to is not that all
characters are encoded in the same length, but that all coding
elements are of the same length. This is in contrast not with ISO
10646, but with "double-byte" encodings of CJK text, where escape
sequences are used to switch between runs of 8-bit and 16-bit codes.

The point of the distinction is that in double-byte encodings the
only way to tell the length of the current character is by parsing
from the beginning of the file. In Unicode, the current 16-bit value
is explicitly a 16-bit character code (assigned, unassigned, or
Private Use), an upper surrogate code, a lower surrogate code, or not
a character code, without reference to what has gone before in the

Edward Cherlin
"A knot!" exclaimed Alice. "Oh, do let me help to undo it."
Alice in Wonderland

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:05 EDT