Re: UTF8 vs. Unicode (UTF16) in code

From: Ienup Sung (ienup.sung@eng.sun.com)
Date: Thu Mar 08 2001 - 19:38:39 EST


Hello,

Actually, as you also implied in your email, since an UTF-16 character can be
a two-byte entity or two two-byte entity, there will be no significant
difference between UTF-8 and UTF-16 in terms of how to count and/or decide
character boundaries in a string. I.e., a UTF-8 character could be a 1, 2, 3,
or 4 byte entity and a UTF-16 character could be a 2 or 4 byte entity since
starting from Unicode 3.1, we will have characters defined outside of BMP,
esp., Plane 01, 02, and 0E.

I do not know however whether MSFT is fully supporting UTF-8 as a multibyte
form in their interfaces and probably that can be answered by MSFT folks.

Yes, you are correct that while MSFT advocates that wchar_t == Unicode,
all other Unix systems are conforming to ISO/ANSI C and POSIX standards that
the wchar_t is an opaque data type. There are a couple of exceptions to
this though; glibc of Linux assumes that the wchar_t == UCS-4 and most of the
commercial Unix systems including Solaris guarantee that a form of Unicode
whether that is UTF-32 or UTF-16 will be used in wchar_t of the Unicode/UTF-8
locales.

With regards,

Ienup

] Date: Wed, 07 Mar 2001 16:21:27 -0800 (GMT-0800)
] From: Allan Chau <achau@rsasecurity.com>
] Subject: UTF8 vs. Unicode (UTF16) in code
] To: Unicode List <unicode@unicode.org>
] MIME-version: 1.0
]
] We've got an English-language only product which makes use of
] single-byte character strings throughout the code. For our next
] release, we'd like to internationalize it (Unicode) & be able to store
] data in UTF8 format (a requirement for data exchange).
]
] We're considering between using UTF8 within the code vs. changing our
] code to use wide characters. I'm wondering what experiences others have
] had that can help with our decision. I'm thinking that using UTF8
] internally may mean less rewriting initially, but we'd have to check
] carefully for code that make assumptions about character boundaries.
] Because of this, I think that it'd be more complicated for developers to
]
] have to work with UTF8 in code. Unicode (UTF16) internally would be
] easier to manage since most characters will essentially be fixed width,
] but there'd be alot of code to rewrite. Also, I've heard of problems
] with the wide character type (wchar_t) having different definitions
] depending on platform (we're running on NT & Sun Solaris). Many of our
] product APIs would also be affected.
]
] Can others offer their insights, suggestions?
]
] Thanks,
] -allan
]
]



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:20 EDT