RE: Unicode control characters

From: François Yergeau (yergeau@alis.com)
Date: Wed Nov 10 1999 - 13:59:37 EST


> De: Markus Kuhn
> Date: mercredi 10 novembre 1999 10:25
>
> MIME text/plain body parts are clearly preformatted CR LF
> separated lines of printable characters, and UTF-8 really should not
> change anything here.

Yes. In MIME text a line end is CRLF, period.

> May be, it would indeed be a wise idea to supplement RFC 2279 with an
> additional spec that clarifies, which Unicode characters are
> allowed to
> be used in a "text/plain ; charset=UTF-8" MIME part. It seems like a
> good idea to explicitly exclude everything in the Cf category. This
> includes the following entries from the Unicode database:

I think this would be very unwise.

> 200C;ZERO WIDTH NON-JOINER;Cf;0;BN;;;;;N;;;;;
> 200D;ZERO WIDTH JOINER;Cf;0;BN;;;;;N;;;;;
> 200E;LEFT-TO-RIGHT MARK;Cf;0;L;;;;;N;;;;;
> 200F;RIGHT-TO-LEFT MARK;Cf;0;R;;;;;N;;;;;
> 202A;LEFT-TO-RIGHT EMBEDDING;Cf;0;LRE;;;;;N;;;;;
> 202B;RIGHT-TO-LEFT EMBEDDING;Cf;0;RLE;;;;;N;;;;;
> 202C;POP DIRECTIONAL FORMATTING;Cf;0;PDF;;;;;N;;;;;
> 202D;LEFT-TO-RIGHT OVERRIDE;Cf;0;LRO;;;;;N;;;;;
> 202E;RIGHT-TO-LEFT OVERRIDE;Cf;0;RLO;;;;;N;;;;;

These are required for minimal, understandable encoding of various languages
(mainly bidi scripts, but the ZWJ and ZWNJ are also necessary for the Indic
scripts, at least). These are NOT gadgets introduced by the high-flying
Unicode folks solely for fancy GUI settings, they are there to meet plain
text requirements. The legacy encodings for those languages that
10646/Unicode integrated (ASMO, ISCII, etc.) had similar controls, by
necessity.

> 206A;INHIBIT SYMMETRIC SWAPPING;Cf;0;BN;;;;;N;;;;;
> 206B;ACTIVATE SYMMETRIC SWAPPING;Cf;0;BN;;;;;N;;;;;
> 206C;INHIBIT ARABIC FORM SHAPING;Cf;0;BN;;;;;N;;;;;
> 206D;ACTIVATE ARABIC FORM SHAPING;Cf;0;BN;;;;;N;;;;;
> 206E;NATIONAL DIGIT SHAPES;Cf;0;BN;;;;;N;;;;;
> 206F;NOMINAL DIGIT SHAPES;Cf;0;BN;;;;;N;;;;;

These are crap. The Unicode standard 2.0 says that their use is
"<em>strongly</em> discouraged". Note however that at least the NATIONAL
DIGIT SHAPE stuff does come, AFAIK, from old encodings used in terminal
applications, not fancy GUI stuff.

> FEFF;ZERO WIDTH NO-BREAK SPACE;Cf;0;BN;;;;;N;BYTE ORDER MARK;;;;

Ah! the infamous BOM! It has a valid use in plain text in indicating a
place where a word break should not occur. A bit fancy for email, but plain
text has other uses than email.

> FFF9;INTERLINEAR ANNOTATION ANCHOR;Cf;0;BN;;;;;N;;;;;
> FFFA;INTERLINEAR ANNOTATION SEPARATOR;Cf;0;BN;;;;;N;;;;;
> FFFB;INTERLINEAR ANNOTATION TERMINATOR;Cf;0;BN;;;;;N;;;;;

These are specifically defined not to be used in plain text. They are meant
to be used, for instance, in a word processor that keeps text and "markup"
(formatting info) in separate memory structures. The markup can use those
beasties as place holders in the text, to indicate where the markup applies.

> 070F;SYRIAC ABBREVIATION MARK;Cf;0;BN;;;;;N;;;;;
> 180B;MONGOLIAN FREE VARIATION SELECTOR ONE;Cf;0;BN;;;;;N;;;;;
> 180C;MONGOLIAN FREE VARIATION SELECTOR TWO;Cf;0;BN;;;;;N;;;;;
> 180D;MONGOLIAN FREE VARIATION SELECTOR THREE;Cf;0;BN;;;;;N;;;;;
> 180E;MONGOLIAN VOWEL SEPARATOR;Cf;0;BN;;;;;N;;;;;
>
> which I am not sure about what they are good for (seems to be new in
> 3.0).

The scripts are new to 3.0. I'm quite sure those guys were not introduced
lightly and have a real requirement in plain text for the relevant scripts.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:56 EDT