RE: ISO 6429 control sequences with non-ASCII CES's

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Mar 13 2007 - 18:09:41 CST

  • Next message: Kenneth Whistler: "RE: ISO 6429 control sequences with non-ASCII CES's"

    That's an interesting point of view. Effectively the escape sequences that
    are used in many CES do use byte values that fall within the ASCII byte
    range (or sometime in higher ranges).

    But there are no well defined CES conversion scheme to convert those
    sequences to Unicode, except by reusing the corresponding ASCII mappings (or
    ISO 8859): that's something that breaks proper parsing of the rest of the
    text at the character properties level.

    It would be better to have in Unicode some special ranges of control
    characters mapped to byte values that are part of unconverted CES sequences
    like in VT100, VT200 (and so on) protocols, or in other legacy terminal
    protocols (to encode colors, cursor control, or other rich text
    enhancements, or the encoding of user-defined bitmaps for custom characters
    or glyphs, notably used in some East-Asian Teletext systems, because trying
    to detect which character those bitmaps represent can be difficult, or even
    impossible, as they were really user-defined and local to the document
    containing those glyph definitions).

    Consider sequences like:
            ESC, [, A, I, R
    (in a 7-bit or 8-bit encoded document prepared and sent on medias that
    support with VT100-like enhancement).

    Or even this one with Videotex:
            ESC, A, I, R

    Do they contain the English word "AIR" or the abbreviation "IR" (preceded by
    a ANSI/VT100-like color attribute)? How can we delimit the length of escape
    sequences?

    At least, with some ISO-based complex sets, we have well-defined registries
    and parsing rules for matching the length of sequences that introduce a
    codepage selection (so that, during conversion, the sequence itself can be
    filtered out, and the rest of the text be interpreted and converted to the
    appropriate code-points according to the subset mapping). But for most
    protocols, we don't have such thing.

    We are even lacking appropriate identification of many national terminal
    protocols (for example the BBS or Videotex systems, or newer CES used for
    DVD subtitles, or in DVB-T subtitle channels or EPG).

    This is a problem when preparing documents for later inclusion on
    multiplexed media or streams (like MPEG streams, or DVB channels for
    satellite, cable, DSL or air transmission), as they will require specific
    software and specific filters.

    Note that proper CES identification is also commonly missing in other very
    used protocols (for example SMS messaging on mobile phones) or are not
    interoperable internationally, and vary between phone operators that each
    need their own custom conversion filters when transmitting something else
    than pure ASCII (and the SMS protocols do not even allow defining custom
    bitmaps for mapping the missing Unicode characters that can't be converted
    in the target CES of the recipient, according to mobile phone capabilities).

    Even with the same operator, there are lots of differences of implementation
    and international support between mobile phones, and for some languages that
    need extra characters, the received message is completely unreadable.

    Even though UTF-8 has progressed significantly in this area, many mobile
    phones lack the necessary built-in font support, and are unable to display
    the associated text; that's something that the mobile phone operator should
    provide for its subscribers, by allowing mobile phones to send their
    capabilities, so that the operator will send small bitmaps to define the
    missing glyphs, along with the UTF-8 encoded message. The mobile phone could
    then contain an internal cache for those "custom" glyphs sent by their
    operator (in most cases, for mobile phone usage, the glyphs do not need to
    be scalable and can be bitmaps in a single size; the devie will then adapt
    its font size to the default size of those glyphs associated to characters
    present in the text).

    Another difficulty is caused by UTF-8 encoded grapheme clusters: most small
    devices are unable to implement the complex decoding and layout algorithm,
    so that's a case where a encoded grapheme cluster should be reencodable as a
    single PUA, and then sent with such PUA and a glyph definition mapped to it.

    But here also, this means that the pure UTF-8 content of the text must also
    allow the inclusion of specific control sequences which are correctly
    identified, and that won't generate garbage on devices that don't know those
    sequences.

    This is not irraisonnable, given that there are still lots of missing
    scripts in the Unicode standard (and glyphs associated to their sequences
    which can't be present in today's devices); it's not even easy or possible
    to upgrade their internal software, but it should be possible to support
    those encoded languages with small devices like handheld PDA with a drawing
    pen, where users can record their glyphs in the internal memory, and then
    use them for messaging over mobile networks. Here again, a protocol will
    need to be able to mix glyphs within their transmitted texts, and such
    protocol will need arbitrary byte values (unless the text is encoded with a
    rich format like XML, plus generic data compression like deflate during
    transmission). This is then no more a plain-text format but a computer
    language with a syntax used to describe the document, even if there are no
    layout information or no rich text information like colours.

    A clean way to avoid false parsing when handling those documents in
    intermediate gateways that use Unicode-based algorithms would be to be able
    to encode with Unicode arbitrary control sequences made of bytes, and send
    them as blind objects. The corresponding characters would no more be
    associated to normal characters (so there would be no risk of some bytes
    being converted to unrelated others because one CES implementation thinks it
    is safe to transform a small Latin letter a with acute into a unaccented
    small a, even when the initially encoded bytes did not have this incorrectly
    assumed semantic).

    But then, there are other solutions than encoding those bytes in Unicode:
    * may be this is where PUAs should be used? Note that PUAs are generally
    handled with the semanctics of symbols, and not the semantic of control
    characters, so they are counted for example when computing line breaks, and
    the insertion of linebreaks by an agent may break the encoded byte sequence
    needed for some origin or target protocols.
    * a transport encoding syntax or escaping mechanism can be standardized on
    top of Unicode; this is similar to the approach taken for emails with
    standard MIME codes for TES, allowing some bytes to have specific meaning,
    and requiring some bytes of a plain UTF-8 text to be transformed or escaped;
    this already allows to transmit arbitrary Unicode-encoded plain-texts even
    on a media which has restrictions (like limited line-length, reserved bytes
    for the transmission protocol itself...).

    Another option would be to encode only two new controls in Unicode:
    * start control sequence;
    * end control sequence.

    In the middle, every code point present do not have their default Unicode
    semantics and properties and must be treated as an unbreakable binary
    encoded object... a good question is then: what is the semantic of the whole
    sequence itself:
    * A control?
    * A rich-text enhancement?
    * A "graphic" PUA (meaning here a complete grapheme cluster) whose semantic
    is global to the document?
    * A contextual object that affects the rendering or interpretation of the
    rest of the document? is it then safe to extract substrings from the
    document? What is the effect of even using only simple truncation of the
    document to a limited length?

    > -----Message d'origine-----
    > De : unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org] De la
    > part de Doug Ewell
    > Envoyé : dimanche 11 mars 2007 23:28
    > À : Unicode Mailing List
    > Objet : ISO 6429 control sequences with non-ASCII CES's
    >
    > ISO 6429 (equivalently ECMA 48, ANSI X3.64) defines terminal control
    > sequences using the control characters in the U+0000 - U+001F block.
    > Many control sequences begin with Escape (U+001B) and also include other
    > characters in the printable Basic Latin block.
    >
    > I get the impression from reading ECMA 48 that these control sequences
    > are defined directly on byte values, not character values. That means
    > they could not be used with Unicode character encoding schemes such as
    > UTF-16, UTF-7, or SCSU, which represent U+001B as something other than
    > the single byte 0x1B. It also means they *could* be used with UTF-8.
    > Is this correct?



    This archive was generated by hypermail 2.1.5 : Tue Mar 13 2007 - 18:14:08 CST