Re: UTF8 and COntrol Characters

From: YTang0648@aol.com
Date: Wed Nov 05 2003 - 13:53:57 EST

Next message: Doug Ewell: "Re: UTF-16 inside UTF-8"

Previous message: YTang0648@aol.com: "Re: UTF-16 inside UTF-8"
Maybe in reply to: Abdij Bhat: "UTF8 and COntrol Characters"
Next in thread: Doug Ewell: "Re: UTF8 and COntrol Characters"
Reply: Doug Ewell: "Re: UTF8 and COntrol Characters"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

I think verdy_p's message is not very clear below. I think I know what he
mean but the message itself need some clearification.

In a message dated 11/5/2003 3:39:09 AM Pacific Standard Time,
verdy_p@wanadoo.fr writes:
From: "Abdij Bhat" <Abdij.Bhat@kshema.com>
> If a UNICODE strings is converted to UTF8, will the UTF8 encoded string
> contain and control character or escape sequences? If so, is it possible
to
> eliminate the same?

UTF-8 sequences will not contain any C0 control bytes,
I think you should say
"The UTF-8 seqences will not use C0 control code area (0x00-0x1F) to
represent characters. " instead of "UTF-8 sequences will not contain any C0 control
bytes, " because it is legal to have C0 control code inside UTF-8, for example,
TAB, CR, LF are all in c0 area and perfectly legal in UTF-8.
but it will in many
cases use contain C1 control bytes (between 0x80 and 0x9F).
I think the rigth way to say is is "UTF-8 may use bytes 0x80 to 0x9F as part
of multiple byte UTF-8 byte serquence for a single Unicode characters. And
those bytes is defined as C1 control area. Therefore, code code sequence with
0x80 and 0x9f should not be insert into UTF-8 STREAM, but could be insert into
UTF-16 STREAM (by using two bytes 0x0080 - 0x009F) .

UTF-8 keeps all 7-bit ASCII characters unchanged and does not create any
sequence of bytes containing them for non 7-bit ASCII characters (all
sequences of UTF-8 bytes are made of bytes>=0x80). UTF-8 will then never
create any escape sequence.

But be warned that you should not create escape sequences containing bytes
>= 0x80 after the leading escape (in this case, they may conflict with a
UTF-8 decoder).. If your escape sequences are made only of 7-bit ASCII
bytes, then this is safe, and you can mix plain-text ASCII, C0 controls,
escape sequences and UTF-8 sequences for non ASCII characters.
Not only "You should not create escape sequences containing bytes
>= 0x80 after the leading escape " but also "You should not create escape
sequences containing bytes >= 0x80 as the leading escape "
Note that C1 controls of Unicode and ISO-8859-* will be converted to a pair
of bytes in UTF-8, with the first byte being 0xC2, and the second byte
varying between 0x80 and 0x9F (so C1 controls will appear in UTF-8 with a
0xC2 "prefix" before the same byte when encoding them with ISO-8859-*)

==================================
Frank Yung-Fong Tang
System Architect, Iñtërnâtiônàl Dèvélôpmeñt, AOL Intèrâçtívë Sërviçes
AIM:yungfongta mailto:ytang0648@aol.com Tel:650-937-2913
Yahoo! Msg: frankyungfongtan

John 3:16 "For God so loved the world that he gave his one and only Son, that
whoever believes in him shall not perish but have eternal life.

Does your software display Thai language text correctly for Thailand users?
-> Basic Conceptof Thai Language linked from Frank Tang's
Iñtërnâtiônàlizætiøn Secrets
Want to translate your English text to something Thailand users can
understand ?
-> Try English-to-Thai machine translation at
http://c3po.links.nectec.or.th/parsit/

Next message: Doug Ewell: "Re: UTF-16 inside UTF-8"
Previous message: YTang0648@aol.com: "Re: UTF-16 inside UTF-8"
Maybe in reply to: Abdij Bhat: "UTF8 and COntrol Characters"
Next in thread: Doug Ewell: "Re: UTF8 and COntrol Characters"
Reply: Doug Ewell: "Re: UTF8 and COntrol Characters"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Nov 05 2003 - 15:13:24 EST