Re: UTF8 and COntrol Characters

From: Doug Ewell (dewell@adelphia.net)
Date: Wed Nov 05 2003 - 16:39:32 EST

Next message: John Delacour: "Re: [OT] Voiced velar fricative"

Previous message: YTang0648@aol.com: "Re: [OT] HTML charset declarations (was: GSM and Unicode)"
In reply to: YTang0648@aol.com: "Re: UTF8 and COntrol Characters"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Frank Yung-Fong Tang <YTang0648 at aol dot com> wrote:

> I think verdy_p's message is not very clear below. I think I know what
he mean
> but the message itself need some clearification.

And I, in turn, will try to clarify the clarification. :-)

>>> If a UNICODE strings is converted to UTF8, will the UTF8 encoded
>>> string contain and control character or escape sequences? If so, is
>>> it possible to eliminate the same?
>>
>> UTF-8 sequences will not contain any C0 control bytes,
>
> I think you should say
> "The UTF-8 seqences will not use C0 control code area (0x00-0x1F) to
> represent characters. " instead of "UTF-8 sequences will not contain
> any C0 control bytes, " because it is legal to have C0 control code
> inside UTF-8, for example, TAB, CR, LF are all in c0 area and
> perfectly legal in UTF-8.

Neither is exactly right. UTF-8 does use the C0 control area, to
represent C0 control characters. What I think everyone is trying to say
is that UTF-8 does not use that area for *any other* characters, which
of course was a basic design goal of UTF-8.

>> but it will in many cases use contain C1 control bytes (between 0x80
>> and 0x9F).
>
> I think the rigth way to say is is "UTF-8 may use bytes 0x80 to 0x9F
> as part of multiple byte UTF-8 byte serquence for a single Unicode
> characters. And those bytes is defined as C1 control area. Therefore,
> code code sequence with 0x80 and 0x9f should not be insert into UTF-8
> STREAM, but could be insert into UTF-16 STREAM (by using two bytes
> 0x0080 - 0x009F) .

No code sequence that is not valid UTF-8 should ever be inserted into a
UTF-8 stream anyway. I don't see the point of this wording.

>> UTF-8 keeps all 7-bit ASCII characters unchanged and does not create
>> any sequence of bytes containing them for non 7-bit ASCII characters
>> (all sequences of UTF-8 bytes are made of bytes>=0x80). UTF-8 will
>> then never create any escape sequence.

But it will retain any existing escape sequence consisting entirely of [
7-bit] ASCII characters.

>> But be warned that you should not create escape sequences containing
>> bytes >= 0x80 after the leading escape (in this case, they may
>> conflict with a UTF-8 decoder).. If your escape sequences are made
>> only of 7-bit ASCII bytes, then this is safe, and you can mix plain-
>> text ASCII, C0 controls, escape sequences and UTF-8 sequences for non
>> ASCII characters.
>
> Not only "You should not create escape sequences containing bytes >=
> 0x80 after the leading escape " but also "You should not create escape
> sequences containing bytes >= 0x80 as the leading escape "

We aren't in a position to tell Abdij not to use the C1 control area.
It's his app. But for the record, he has stated that he isn't:

> Yes, the control characters are entirely below 0x20 ASCII.

so there is neither a problem nor a topic for discussion.

-Doug Ewell
Fullerton, California
http://users.adelphia.net/~dewell/

Next message: John Delacour: "Re: [OT] Voiced velar fricative"
Previous message: YTang0648@aol.com: "Re: [OT] HTML charset declarations (was: GSM and Unicode)"
In reply to: YTang0648@aol.com: "Re: UTF8 and COntrol Characters"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Nov 05 2003 - 17:29:01 EST