Re: UTF8 and COntrol Characters

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Nov 05 2003 - 06:32:57 EST

Next message: Peter Kirk: "Re: UTF-16 inside UTF-8"

Previous message: Philippe Verdy: "Re: elided base character or obliterated character (was: Hebrew composition model, with cantillation marks)"
In reply to: Abdij Bhat: "UTF8 and COntrol Characters"
Next in thread: YTang0648@aol.com: "Re: UTF8 and COntrol Characters"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

From: "Abdij Bhat" <Abdij.Bhat@kshema.com>
> If a UNICODE strings is converted to UTF8, will the UTF8 encoded string
> contain and control character or escape sequences? If so, is it possible
to
> eliminate the same?

UTF-8 sequences will not contain any C0 control bytes, but it will in many
cases use contain C1 control bytes (between 0x80 and 0x9F).

UTF-8 keeps all 7-bit ASCII characters unchanged and does not create any
sequence of bytes containing them for non 7-bit ASCII characters (all
sequences of UTF-8 bytes are made of bytes>=0x80). UTF-8 will then never
create any escape sequence.

But be warned that you should not create escape sequences containing bytes
>= 0x80 after the leading escape (in this case, they may conflict with a
UTF-8 decoder).. If your escape sequences are made only of 7-bit ASCII
bytes, then this is safe, and you can mix plain-text ASCII, C0 controls,
escape sequences and UTF-8 sequences for non ASCII characters.

Note that C1 controls of Unicode and ISO-8859-* will be converted to a pair
of bytes in UTF-8, with the first byte being 0xC2, and the second byte
varying between 0x80 and 0x9F (so C1 controls will appear in UTF-8 with a
0xC2 "prefix" before the same byte when encoding them with ISO-8859-*)

Next message: Peter Kirk: "Re: UTF-16 inside UTF-8"
Previous message: Philippe Verdy: "Re: elided base character or obliterated character (was: Hebrew composition model, with cantillation marks)"
In reply to: Abdij Bhat: "UTF8 and COntrol Characters"
Next in thread: YTang0648@aol.com: "Re: UTF8 and COntrol Characters"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Nov 05 2003 - 07:19:36 EST