Re: Character proposal: SUBSCRIPT TEN

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Jan 16 2008 - 20:47:31 CST

  • Next message: Benjamin M Scarborough: "Re: Character proposal: SUBSCRIPT TEN"

    > > That's debatable. For transcoding obscure character encodings,
    > > there really is no requirement that you have one-to-one
    > > mappings for every character. You can certainly represent
    > > the subscript 10 in GOST 10859 with <2081, 2080> in Unicode
    > > and convert it back losslessly with no problem.
    >
    > Lossless conversion is fine, but I'm interested in a portable exact
    > representation of a GOST printout.

    Or more precisely, apparently what you are after is portable,
    exact *plain text* representation of a GOST printout.

    If you just wanted a portable exact representation of a GOST
    printout, pdf should do just fine.

    > I would not object to a rich text approach if there was a way to do
    > it, e.g. if something like
    > <halfwidth>&#x2081;&#x2080;</halfwidth> existed and could do the job.

    Well, HTML is a pretty lousy format for "exact representation"
    anyway. You can't even really control the font people are
    going to display your page in.

    And <halfwidth> hacks, even if they existed, wouldn't really do
    the job, either, because they would only make sense in an East
    Asian context, contrasted with <fullwidth>. All of the characters
    you would be using would *be* halfwidth, anyway, as contrasted
    with the fullwidth FF11 and FF10, for example.

    > > > What should an emulator of a computer that used GOST 10859 or ALCOR
    > > > produce, then?
    > >
    > > For an emulator you would have various options, including
    > > mapping of the sequence <2081, 2080> to your fixed-width
    > > ACPU-128 drum printer font glyph for a subscript 10. Or,
    > > if your emulator is making one-to-one character to glyph
    > > assumptions, then you use a PUA value to stand in for the
    > > sequence, and map *that* to your fixed-width glyph.
    >
    > Correct me if I'm wrong, but AFAIK the ways to attach private glyphs
    > to network documents are not standardized nor widely supported yet.

    No, you're not wrong about that. But I was responding to the
    question about an *emulator*, where I assumed you had software
    running that controls its own fonts.

    If you mean by an "emulator" something that just spits out
    HTML pages and posts them for viewing on the web, expecting
    the results to look exactly as if printed on an ACPU-128 drum
    printer, then my inclination would be to go with pdf output
    instead. ;-)

    > > However, justification in terms of emulation of long unused
    > > character sets and computing machinery isn't a very strong
    > > case, since emulation software is *software*, after all, and
    > > always has plenty of options to deal with such problems
    > > creatively, as long as all the component pieces needed for
    > > character representation are present in Unicode.
    >
    > Typesetting software has too, but that did not seem to stop people
    > from requesting and acquiring separate codepoints for monospaced
    > letters and digits
    > (U+1D670 - U+1D6A3, U+1D7F6 - U+1D7FF).

    I think you may mistake the function of those. Nobody suggests
    those should be used for typography. They are there essentially
    for mathematical software that needs to be able to semantic
    distinctions for styled variables, without carrying around font
    and style tags.

    > If we're to follow the spirit of UTN28, we should add a mathematical decimal
    > exponent base character at least to allow for the unambiguous
    > scientific representation of reals
    > in math texts. What does 1.5e+3 without a U+2062 (invisible times)
    > before 'e' really mean? 1500 or 7.077?

    I'm not sure that's relevant to a request to encode a (visible)
    subscript 10.

    For that matter, what does '10' really mean? Is it two or ten or
    sixteen? In my line of work I never really know without context.

    > Subscripts after numbers already have a different meaning to indicate
    > the base of the numeral system.

    And again, I'm not seeing the relevance of that to the encoding
    request. Sure, they mean that, but they can mean other things
    in math as well, and they get used in lots of orthographic conventions
    simply for indicating indexes on items.

    > Does it look more convincing now?

    Not yet.

    You have:

    1. <2071, 2070> available in plain text simply to represent the content.

    2. <sub>10</sub> available in many markup languages.

    And the problem is that neither of those works, in plain text or
    in HTML pages, to get the monospace layout you want for this
    application.

    But I'm certainly not convinced that Unicode has to solve the monospace
    layout problem for plain text.

    And the lack of character-by-character aspect and monospacing control
    in "light" markup like HTML isn't really Unicode's problem, either.

    The thing that would be convincing for me, personally (although I
    don't speak for everybody on this list, obvious), is if I felt
    there was an interoperability issue for working with the GOST 10859
    standard that required introduction of a compatibility character
    for one-to-one mapping. But it is hard to make such an interoperability
    argument for essentially dead encodings. It is much easier to
    make the case for widespread encodings that everybody has to
    implement, like GB 18030, which has various thingums in it that
    would otherwise not likely have been encoded in Unicode.

    Let me give you another example: The North Korean character
    encoding standard, KPS 9566-97, contains in it, among other
    things, 3 characters spelling out KIM JONG IL in a special,
    bolded font, and another 3 characters spelling out KIM IL SUNG
    in that same font. Now if I was writing an "emulator" for
    North Korean hardware using that character set, I could have
    a problem, because the UTC (and WG2) declined to add those
    6 characters to Unicode and ISO/IEC 10646. Now in that case,
    for web pages, you could use the regular Hangul syllable
    codes for "kim" "jong" "il" and so forth, and use <b></b>
    markup on them, to get close. But if you are looking for
    "exact representation", this might not be what you are after,
    because there is no guarantee that simply bolding the
    Hangul font on your machine has the same effect as the
    emphasis for the 6 characters in question in the
    KPS 9566-97 standard.

    Now granted that case isn't as intractable as what you are
    dealing with, because it doesn't involve inability to line
    up columns in monospace printout.

    But I think it illustrates another instance of appropriate
    skepticism at this point about simply encoding compatibility
    characters in Unicode for every character in every obscure
    historic character encoding that people dig up.

    I think you would need to answer that skepticism to get the
    UTC on your side for encoding a subscript 10 as a single
    character.

    On the other hand, there is so much compatibility dreck in
    the standard already, maybe nobody would even notice. ;-)

    --Ken



    This archive was generated by hypermail 2.1.5 : Wed Jan 16 2008 - 20:49:49 CST