Re: Perception that Unicode is 16-bit (was: Re: Surrogate space in Unicode)

From: Joel Rees (rees@server.mediafusion.co.jp)
Date: Wed Feb 21 2001 - 23:03:50 EST


Hi, William,

I have to admit that I really haven't looked carefully at your
transformation techniques and their intended purpose. But it strikes me that
you might be re-inventing the wheel. A number of schemes exist for squeezing
wide bit patterns into narrow bit streams. UTF-8 has been adopted by UNICODE
for squeezing UNICODE into eight streams. UTF-7 is a proposal for squeezing
UNICODE into 7 bit streams. I strongly urge you to examine both before you
finalize your code.

Explanations of UTF-8 are on the UNICODE site (somewhere), but you may need
to look up UTF-7 via google.com or another search site. I assume that you
have already examined the "quoted printable" and "base 64" techniques, since
the state machine you describe seems to bear their influence.

I'm glad my quick description helped. You may also want to check your code
against the example Java (I think) source for handling surrogate pairs
available either on the UNICODE site or the ISO site for ISO/IEC 10646. I
should have mentioned that in the earlier post, and I apologize.

Joel Rees, Media Fusion KK
Amagasaki, Japan



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:19 EDT