Many of the explanations of UTF-8 discuss encoding of code points on Code
Planes 1-16 using the intermediate concept of surrogates as in UTF-16. I
believe that this is both unnecessary and misleading, as UTF-8 is
fundamentally a direct 21-bit encoding scheme, as may be seen in the
attached document. So, I believe that the concept of surrogates is not
relevant for UTF-8 encoding on Code Planes above the BMP.
This is a slightly different explanation of how UTF-8 works, written by me
for the Ultracode(r) bar code spec (Ultracode encodes all of Unicode 3
directly). If any Unicodotti find any errors in it... please let me know!
Clive P Hohberger, PhD
VP, Technology Development
& Director of Patent Affairs
Zebra Technologies Corporation
333 Corporate Woods Parkway
Vernon Hills IL 60061-3109 USA
Voice: +1 847 793 2740
FAX: +1 847 793 5573
Cellular: +1 847 910 8794
From: Theodore H. Smith [mailto:firstname.lastname@example.org]
Sent: Wednesday, May 29, 2002 7:12 AM
Subject: How is UTF8, UTF16 and UTF32 encoded?
I need to know exactly how UTF8, UTF16 and UTF32 is encoded. I heard
that UTF32 can have surrogates, so I can't just expect them
to be scalar values.
Having a nice detailed and clear explanation would help, with
plenty of examples and effects of the encoding and all kinds of
things to make it easier to understand would help.
Or perhaps I'm just reacting to the confusion of the UniCode
website and its not that hard to understand and a simple definition
would do? But the first idea certainly wouldn't hurt.
-- Theodore H. Smith - Macintosh Consultant / Contractor. My website: <www.elfdata.com/>
This archive was generated by hypermail 2.1.2 : Wed May 29 2002 - 15:48:50 EDT