Re: UTF-16 inside UTF-8

From: YTang0648@aol.com
Date: Wed Nov 05 2003 - 14:48:44 EST

Next message: John Hudson: "RE: ZWJ/ZWNJ in combining mark sequences"

Previous message: Peter Kirk: "Re: UTF-16 inside UTF-8"
Maybe in reply to: Jill Ramonsky: "UTF-16 inside UTF-8"
Next in thread: YTang0648@aol.com: "Re: UTF-16 inside UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

In a message dated 11/5/2003 11:15:44 AM Pacific Standard Time,
dewell@adelphia.net writes:
Frank Yung-Fong Tang <YTang0648 at aol dot com> wrote:

>> At the risk of upsetting the open-source faithful, that is just plain
>> lazy.
>
> I don't think you shoudl call it "lazy". It is just "under
> construction" if such software is still in "alpha". How many software
> have such support in their "Alpha" stage in your company ?

My company is not the best example here; we're well behind the curve
when it comes to Unicode, and i18n/L10n generally.

That said, I think it would be much faster and less error-prone, for a
company adding Unicode support to a product for the first time, to
support the entire Unicode range from the outset, rather than supporting
just the BMP in the alpha stage and then "adding" support for
supplementary characters. For UTF-8 in particular, I can't imagine why
one would choose to implement the 1-, 2-, and 3-byte forms in one stage
and add the 4-byte forms in a later stage.

If you ever move a software implementation from support only single byte
charset to support full unicode 4.0 , then you will be able to image it.
Especially if the project also have 20-100 people working that who don't care too much
about unicode or international support. I have working on such projects for
more than 10 years. And for me, it is very reasonable to have such staging
approach.

for a very simple reason. Usually what happen is the software need to use
something other than UTF-8 for internal process. For example, mozilla take UTF-8
as input and it convert to UTF-16 for internal storage. The reason the UTF-8
is not ideal for some internal process is for example "ToUpper" ot 'ToLower"
operation (or collation, etc) it is much easier to build a UCS2 base toupper to
lower table than a UTF-8 based one.

Because of this, software process probably don't want to use UTF-8 as
internal. It is ok for those software which just store the data or pass the data by
to use UTF-8 as internal, but UTF-8 is not ideal as internal format for those
software process data.

Then the next reason is the software may have some api which take or return
character index of a string. For example if your software have api like the
following:

int TheFirstCharacterInTheString( String, Character) return the first
character index of the character in the String
or
string TheLeftSubString( String, Length) return the left "length' characters.

then UCS2 or UCS4 is eaiser to deal with, and UTF-8 or UTF-16 is much harder
to deal with. Because in UCS2 or UCS4, you can find out the memory requirments
/ offset from the character index, and vise versa. But in the UTF-8 or
UTF-16, you cannot. For return the index or lenght, you basically need to have two
set of api, one to return the number of "characters" and one to return the
number "memory requirment" if the caller may need to prepare the memory. )

Because of this, it is much easier to use UCS2 or UCS4 in the API or probably
I should say private interface inside the software. However, using UCS4 will
doble the memory requirment compare to UCS2, which already double the memory
requirment from the single byte only support (for some software, that mean the
last version). Therefore, it is eaiser to move from only support single bytes
encoding to move to a UTF-8 support which only up to 3 bytes in the first
version they move to Unicode.

I am not saying this is the ideal case and they should do that. I am just
telling you what will people face and think when they move from a ISO-8859-1 only
implementation to a pure Unicode implementation. A lot of time, they need to
deal with one thing per step.

Usually the staging approach is
1. add the internal data type from char to some other data type, probably a
typedef uniChar
if you ask the uniChar to be 4 bytes, you will hit a hard wall, die and stop
there. If you ask the uniChar to be 2 bytes, you will hit a wall, break both
your head and the wall and continue.
2. add converter to convert ISO-8859-1 and UTF-8 from/to that uniChar
3. Migrate all the code
4. Talk to people about support UTF-16 or change uniChar to 4 bytes after you
proof changing 1-3 bring in a lot of value and does not cause too much
performance/footprint issue.

-Doug Ewell
Fullerton, California
http://users.adelphia.net/~dewell/

==================================
Frank Yung-Fong Tang
System Architect, Iñtërnâtiônàl Dèvélôpmeñt, AOL Intèrâçtívë Sërviçes
AIM:yungfongta mailto:ytang0648@aol.com Tel:650-937-2913
Yahoo! Msg: frankyungfongtan

John 3:16 "For God so loved the world that he gave his one and only Son, that
whoever believes in him shall not perish but have eternal life.

Does your software display Thai language text correctly for Thailand users?
-> Basic Conceptof Thai Language linked from Frank Tang's
Iñtërnâtiônàlizætiøn Secrets
Want to translate your English text to something Thailand users can
understand ?
-> Try English-to-Thai machine translation at
http://c3po.links.nectec.or.th/parsit/

Next message: John Hudson: "RE: ZWJ/ZWNJ in combining mark sequences"
Previous message: Peter Kirk: "Re: UTF-16 inside UTF-8"
Maybe in reply to: Jill Ramonsky: "UTF-16 inside UTF-8"
Next in thread: YTang0648@aol.com: "Re: UTF-16 inside UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Nov 05 2003 - 15:37:31 EST