RE: any unicode conversion tools?

From: Peter Constable (petercon@microsoft.com)
Date: Fri May 07 2004 - 12:36:57 CDT

Next message: E. Keown: "RE: Phoenician [ English ]"
Previous message: Peter Constable: "MSKLC (was RE: Philippe's Management of Microsoft (was: Re: Yoruba Keyboard)"
Maybe in reply to: Chan Fook Sheng: "any unicode conversion tools?"
Next in thread: Kent Karlsson: "RE: any unicode conversion tools?"
Reply: Kent Karlsson: "RE: any unicode conversion tools?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

> > UTF-8 encoded sequences can be up to 5 bytes long...
>
> How is that possible. I was under the impression that a UTF-8
> sequence
> could never be more than 4 bytes (i.e. U+10FFFF becomes F4 8F BF BF).

Philippe chastised Chan for mentioning illegal sequences, but then went
on to make reference to there being other illegal sequences.

UTF-8 sequences, as originally defined, could be longer than four bytes,
in order to address codepoints in the vast expanse of UCS-4 at
U+110000..U+FFFFFFFF. Since the accepted code space has been constrained
to U+0000..U+10FFFF, only four bytes are needed. There are non-UTF-8s --
beasts that kind of look like UTF-8 but aren't -- in which sequences of
varying length represent the same character and sequences of more than
four bytes appear, but they are not UTF-8; those byte sequences are
considered illegal in UTF-8.

Peter Constable

Next message: E. Keown: "RE: Phoenician [ English ]"
Previous message: Peter Constable: "MSKLC (was RE: Philippe's Management of Microsoft (was: Re: Yoruba Keyboard)"
Maybe in reply to: Chan Fook Sheng: "any unicode conversion tools?"
Next in thread: Kent Karlsson: "RE: any unicode conversion tools?"
Reply: Kent Karlsson: "RE: any unicode conversion tools?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri May 07 2004 - 18:45:26 CDT