C1 Control Pictures Proposal

From: Sean Leonard <lists+unicode_at_seantek.com>
Date: Sat, 13 Aug 2011 10:48:24 -0700

Greetings--hi all, I'm a new poster. I read on the unicode.org website that a good way to gauge interest and get a proposal through the process is to gather feedback and comments here before investing the time in a formal proposal, so, here goes...

This posting is to propose the addition of C1 Control Pictures to Unicode. It is being proposed by me, Sean Leonard, with the advice and +1 of Frank da Cruz.

Many years ago (in 1998), Frank da Cruz proposed a large number of additional characters for terminal emulation and the like, which can be found on the web and in the mail list archive variously:
 ADDITIONAL CONTROL PICTURES FOR UNICODE ftp://kermit.columbia.edu/kermit/ucsterminal/control.txt
 TERMINAL GRAPHICS FOR UNICODE ftp://kermit.columbia.edu/kermit/ucsterminal/ucsterminal.txt
 HEX BYTE PICTURES FOR UNICODE ftp://kermit.columbia.edu/kermit/ucsterminal/hex.txt
Subject lines (1998):
 Terminal Graphics Proposal
 Terminal Graphics Draft 2

The proposal I would like to make here is much more modest: this proposal is only for the inclusion of C1 Control Pictures into the Unicode Standard. Frank explained to me that his original mega-proposals were rejected. However, I looked through the "Archive of Notices of Non-Approval" and was unable to find an explicit rejection of his proposals. In any event, if one reads through the old e-mail threads from 1998, one will find that the C1 Control Pictures subset of the proposals received a (luke)warm welcome.

RATIONALE

The Unicode code points U+0000 through U+00FF share the equivalent values from the ASCII Standard, ISO 646, ISO 6429, and ISO 8859-1. In many contexts, it is desirable to display all of these code points/characters uniquely and unambiguously. C0 Control Pictures are currently encoded in the Unicode Standard at U+2400; that block currently covers the undisplayable code points at U+0000-U+0020 (plus a few extra alternatives/additions). However, the undisplayable characters in U+0080-U+00FF are left out.

There are several business cases in which C1 Control Pictures are useful:
1. Terminal emulators need them for debugging.
2. Data analyzers need them so they can have a unique character that when the graphics subsystem/text renderers render each character, is intended for display rather than for control effects.
3. Engineers can distinguish when communicating between the data without side-effects (i.e., control characters as pictures), and the data that invokes side-effects (i.e., control characters used as control characters).
4. There are use cases for historic or scholarly purposes, to encode and discuss these characters in text, as distinct from invoking their side-effects (and displaying nothing).
5. To display all values in U+0000 - U+00FF as distinct _characters_, rather than in hexadecimal representation (which makes deciphering the meaning of the codes for graphic characters in the ASCII (G0) & ISO 8859-1 (G1) range very difficult), in the same width and font as the rest of the graphic characters.

6. In support of 1-5, font designers can design fonts that support C1 Control Pictures and that map glyphs to Unicode code points uniformly and interchangeably (two key architectural goals of the Unicode Standard). Without C1 Control Pictures, it is infeasible to provide graphical representations of the C1 Control Characters. This is an asymmetry compared to the C0 Control Pictures block in Unicode, and thus should be remedied.

Quoting from the Unicode Standard 6.0.0, sec. 16.1:
There are 65 code points [C0, C1, delete] set aside...for compatibility with the C0 and C1 control codes defined in the ISO/IEC 2022 framework.
The Unicode Standard provides for the intact interchange of these code points, neither adding to nor subtracting from their semantics. ... [i]n the absence of specific application uses, they may be interpreted according to the control function semantics specified in ISO/IEC 6429:1992.

In accordance with this and other text in the Standard, it is not really possible to assign glyphs uniformly and interchangeably to the code points in U+0000-U+001F and U+0080-U+009F. Variation selectors (sec. 16.4), for example, "provide a mechanism for specifying a restriction on the set of glyphs that are used to represent a particular character [examples given of CJK ideographs and Mongolian letters]." Variation selectors and other Unicode-defined control code points are ill-suited to causing C1 values to be displayed, because C1 values have no "display representation" in and of themselves.

PROPOSED CHARACTERS WITH NOTES

C1 Control Pictures
Hex Name Symbol for...

80 PAD PADDING CHARACTER
Allegedly not in ISO 6429. (Need to check historical versions; other sources.)

81 HOP HIGH OCTET PRESET
Allegedly not in ISO 6429. (Need to check historical versions; other sources.)

82 BPH BREAK PERMITTED HERE

83 NBH NO BREAK HERE

84 IND INDEX
"Move the active position one line down, to eliminate ambiguity about the meaning of LF. Deprecated in 1988 and withdrawn in 1992 from ISO/IEC 6429 (1986 and 1991 respectively for ECMA-48)." (from Wikipedia)

85 NEL NEXT LINE

86 SSA START OF SELECTED AREA

87 ESA END OF SELECTED AREA

88 HTS CHARACTER TABULATION SET

89 HTJ CHARACTER TABULATION WITH JUSTIFICATION

8A VTS LINE TABULATION SET

8B PLD PARTIAL LINE DOWN

8C PLU PARTIAL LINE UP

8D RI REVERSE LINE FEED

8E SS2 SINGLE SHIFT TWO

8F SS3 SINGLE SHIFT THREE

90 DCS DEVICE CONTROL STRING

91 PU1 PRIVATE USE ONE

92 PU2 PRIVATE USE TWO

93 STS SET TRANSMIT STATE

94 CCH CANCEL CHARACTER

95 MW MESSAGE WAITING

96 SPA START OF GUARDED AREA

97 EPA END OF GUARDED AREA

98 SOS START OF STRING

99 SGCI SINGLE GRAPHIC CHARACTER INTRODUCER
Allegedly not in ISO 6429. (Need to check historical versions; other sources.)

9A SCI SINGLE CHARACTER INTRODUCER

9B CSI CONTROL SEQUENCE INTRODUCER

9C ST STRING TERMINATOR

9D OSC OPERATING SYSTEM COMMAND

9E PM PRIVACY MESSAGE

9F APC APPLICATION PROGRAM COMMAND

A0 NBSP NO-BREAK SPACE
Purpose is to show in distinction to SP (SPACE)

AD SHY SOFT HYPHEN
Show - with SHY above or around it, similar to Unicode Standard document for U+00AD
(SHY may be the most "controversial" character. See above for rationale--the objective is to provide visually distinct characters throughout the U+0000-U+00FF range. U+00AD is visually identical to the U+002D hyphen-minus; the only distinction is a "control" distinction, which is non-visual. Hence, the distinction should be made visually, with a distinct code point.)

UNICODE CODE POINT ASSIGNMENTS

Unicode code point assignments are not explicitly advocated for in this initial, informal proposal. While it would be nice to place these codes adjacent or in the U+2400 block, there are not enough free code points to shoehorn them all in.

MODIFICATIONS TO THE UNICODE STANDARD

It is proposed that section 15.6, Technical Symbols, be extended to discuss both C0 and C1 controls.

-Sean Leonard
SeanTek
Received on Sat Aug 13 2011 - 12:50:23 CDT

This archive was generated by hypermail 2.2.0 : Sat Aug 13 2011 - 12:50:24 CDT