From: Doug Ewell (dewell@adelphia.net)
Date: Wed Nov 05 2003 - 16:39:32 EST
Frank Yung-Fong Tang <YTang0648 at aol dot com> wrote:
> I think verdy_p's message is not very clear below. I think I know what
he mean
> but the message itself need some clearification.
And I, in turn, will try to clarify the clarification. :-)
>>> If a UNICODE strings is converted to UTF8, will the UTF8 encoded
>>> string contain and control character or escape sequences? If so, is
>>> it possible to eliminate the same?
>>
>> UTF-8 sequences will not contain any C0 control bytes,
>
> I think you should say
> "The UTF-8 seqences will not use C0 control code area (0x00-0x1F) to
> represent characters. " instead of "UTF-8 sequences will not contain
> any C0 control bytes, " because it is legal to have C0 control code
> inside UTF-8, for example, TAB, CR, LF are all in c0 area and
> perfectly legal in UTF-8.
Neither is exactly right. UTF-8 does use the C0 control area, to
represent C0 control characters. What I think everyone is trying to say
is that UTF-8 does not use that area for *any other* characters, which
of course was a basic design goal of UTF-8.
>> but it will in many cases use contain C1 control bytes (between 0x80
>> and 0x9F).
>
> I think the rigth way to say is is "UTF-8 may use bytes 0x80 to 0x9F
> as part of multiple byte UTF-8 byte serquence for a single Unicode
> characters. And those bytes is defined as C1 control area. Therefore,
> code code sequence with 0x80 and 0x9f should not be insert into UTF-8
> STREAM, but could be insert into UTF-16 STREAM (by using two bytes
> 0x0080 - 0x009F) .
No code sequence that is not valid UTF-8 should ever be inserted into a
UTF-8 stream anyway. I don't see the point of this wording.
>> UTF-8 keeps all 7-bit ASCII characters unchanged and does not create
>> any sequence of bytes containing them for non 7-bit ASCII characters
>> (all sequences of UTF-8 bytes are made of bytes>=0x80). UTF-8 will
>> then never create any escape sequence.
But it will retain any existing escape sequence consisting entirely of [
7-bit] ASCII characters.
>> But be warned that you should not create escape sequences containing
>> bytes >= 0x80 after the leading escape (in this case, they may
>> conflict with a UTF-8 decoder).. If your escape sequences are made
>> only of 7-bit ASCII bytes, then this is safe, and you can mix plain-
>> text ASCII, C0 controls, escape sequences and UTF-8 sequences for non
>> ASCII characters.
>
> Not only "You should not create escape sequences containing bytes >=
> 0x80 after the leading escape " but also "You should not create escape
> sequences containing bytes >= 0x80 as the leading escape "
We aren't in a position to tell Abdij not to use the C1 control area.
It's his app. But for the record, he has stated that he isn't:
> Yes, the control characters are entirely below 0x20 ASCII.
so there is neither a problem nor a topic for discussion.
-Doug Ewell
Fullerton, California
http://users.adelphia.net/~dewell/
This archive was generated by hypermail 2.1.5 : Wed Nov 05 2003 - 17:29:01 EST