RE: how to add all latin (and greek) subscripts

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sun Jun 29 2008 - 22:56:11 CDT

  • Next message: Andreas Stötzner: "Re: Capital Sharp S in the News"

    David Starner wrote:
    > But the terminal is not remotely a plain text application. It
    > already handles a wide variety of formatting, like bold and
    > italics, and there's absolutely no reason you couldn't add
    > subscript and superscript, or even full Tex-like markup.
    > Extending plain text is frequently not the right way to
    > attack a problem.

    Exactly!

    In fact as soon as you start extending Unicode for what it is not, you'll
    immediately realize that you'll then need to reencode subscript and
    superscript variants of almost all existing ''normal'' character base
    characters; then you'll have to do the same for other font variants. For all
    this use markup language.

    This just proves that superscript and subscripts are just provided for
    compatibility only, and that without this need they should have never been
    encoded, including for plain-text where other linear notations/conventions
    would have been used instead (for example "5.1e22" commonly used instead of
    "5.1×10²²" or "10 km^2" instead of "10 km²").

    And you'll also need more superscript and subscript levels (for this use,
    notations like TeX or MathML can be transported in plain text by using their
    conventional syntax). Plain text is not made to transport the text layout,
    just the basic semantic; for the rest you need some other convention,
    notation, or higher protocol... This is just like in natural written
    languages, with their conventional orthographies, that Unicode is also not
    encoding: otherwise we would need the encoding of a separate Altaic alphabet
    for Turkish, a Latin alphabet for English, another Latin alphabet for German
    with the special handling of umlauts (at linguistic level only) like
    vowels...

    So there's really no end to the desire to encode contextual variants as new
    characters. As the needs fo variants is orthogonal to the need of supporting
    a large set, the only safe way is effectively to not encode contextual
    variants, as most as possible, but only the common abstract characters, and
    decide that layout and style information is not part of the standard and
    will require another higher-order protocol.

    We can easily realize that, as a general rule, if two uses of some
    characters carry the same visual value and interpretation when seen out of
    their context where they may appear, and if they can obey to the same
    composition rules in arbitrary layouts, then they have to share the same
    encoding as abstract characters even if they have several distinctive
    contextual realizations. Superscripts and subscripts for example are not
    different from normal script if seen isolately: there's just a different of
    default size or position but even the text size and position is not encoded
    in any character itself and they remain reasable and meaningful even in this
    context.

    The layout may add additional information by itself independantly of the
    context neutral semantic of the plaint text characters that they are
    augmenting. If you are converting a text with layout to plain text and
    completely drop the layout information without converting it to some
    notation, this is where you may loose or change the semantic. For example
    when converting "10²" to "102": this is not the fault of Unicode, it's just
    your fault for not introducing and conveying some alternative notation like
    "10^2" and explicitng in you plain text conventions that this notation is
    used or by specifying it as meta-information parallel to the transmission of
    the text itself.

    (Note that the encoded modifier letters and IPA symbols are NOT true
    superscripts as they are really meant as distinctive elements where the
    choice of the borrowed letter is quite arbitrary): they can't be used to
    write arbitrary words written with the Latin alphaber for example, and they
    are not necessarily designed to properly line-up on their superscript
    baseline. To write regular words or even full sentences in superscript, use
    some conventional notation (like punctuation) or layout
    structure/syntax/protocol, but encode the words themselves using the regular
    letters and everyone will be happy.



    This archive was generated by hypermail 2.1.5 : Mon Jun 30 2008 - 10:08:24 CDT