From: Kenneth Whistler (email@example.com)
Date: Wed Jan 16 2008 - 20:47:31 CST
> > That's debatable. For transcoding obscure character encodings,
> > there really is no requirement that you have one-to-one
> > mappings for every character. You can certainly represent
> > the subscript 10 in GOST 10859 with <2081, 2080> in Unicode
> > and convert it back losslessly with no problem.
> Lossless conversion is fine, but I'm interested in a portable exact
> representation of a GOST printout.
Or more precisely, apparently what you are after is portable,
exact *plain text* representation of a GOST printout.
If you just wanted a portable exact representation of a GOST
printout, pdf should do just fine.
> I would not object to a rich text approach if there was a way to do
> it, e.g. if something like
> <halfwidth>₁₀</halfwidth> existed and could do the job.
Well, HTML is a pretty lousy format for "exact representation"
anyway. You can't even really control the font people are
going to display your page in.
And <halfwidth> hacks, even if they existed, wouldn't really do
the job, either, because they would only make sense in an East
Asian context, contrasted with <fullwidth>. All of the characters
you would be using would *be* halfwidth, anyway, as contrasted
with the fullwidth FF11 and FF10, for example.
> > > What should an emulator of a computer that used GOST 10859 or ALCOR
> > > produce, then?
> > For an emulator you would have various options, including
> > mapping of the sequence <2081, 2080> to your fixed-width
> > ACPU-128 drum printer font glyph for a subscript 10. Or,
> > if your emulator is making one-to-one character to glyph
> > assumptions, then you use a PUA value to stand in for the
> > sequence, and map *that* to your fixed-width glyph.
> Correct me if I'm wrong, but AFAIK the ways to attach private glyphs
> to network documents are not standardized nor widely supported yet.
No, you're not wrong about that. But I was responding to the
question about an *emulator*, where I assumed you had software
running that controls its own fonts.
If you mean by an "emulator" something that just spits out
HTML pages and posts them for viewing on the web, expecting
the results to look exactly as if printed on an ACPU-128 drum
printer, then my inclination would be to go with pdf output
> > However, justification in terms of emulation of long unused
> > character sets and computing machinery isn't a very strong
> > case, since emulation software is *software*, after all, and
> > always has plenty of options to deal with such problems
> > creatively, as long as all the component pieces needed for
> > character representation are present in Unicode.
> Typesetting software has too, but that did not seem to stop people
> from requesting and acquiring separate codepoints for monospaced
> letters and digits
> (U+1D670 - U+1D6A3, U+1D7F6 - U+1D7FF).
I think you may mistake the function of those. Nobody suggests
those should be used for typography. They are there essentially
for mathematical software that needs to be able to semantic
distinctions for styled variables, without carrying around font
and style tags.
> If we're to follow the spirit of UTN28, we should add a mathematical decimal
> exponent base character at least to allow for the unambiguous
> scientific representation of reals
> in math texts. What does 1.5e+3 without a U+2062 (invisible times)
> before 'e' really mean? 1500 or 7.077?
I'm not sure that's relevant to a request to encode a (visible)
For that matter, what does '10' really mean? Is it two or ten or
sixteen? In my line of work I never really know without context.
> Subscripts after numbers already have a different meaning to indicate
> the base of the numeral system.
And again, I'm not seeing the relevance of that to the encoding
request. Sure, they mean that, but they can mean other things
in math as well, and they get used in lots of orthographic conventions
simply for indicating indexes on items.
> Does it look more convincing now?
1. <2071, 2070> available in plain text simply to represent the content.
2. <sub>10</sub> available in many markup languages.
And the problem is that neither of those works, in plain text or
in HTML pages, to get the monospace layout you want for this
But I'm certainly not convinced that Unicode has to solve the monospace
layout problem for plain text.
And the lack of character-by-character aspect and monospacing control
in "light" markup like HTML isn't really Unicode's problem, either.
The thing that would be convincing for me, personally (although I
don't speak for everybody on this list, obvious), is if I felt
there was an interoperability issue for working with the GOST 10859
standard that required introduction of a compatibility character
for one-to-one mapping. But it is hard to make such an interoperability
argument for essentially dead encodings. It is much easier to
make the case for widespread encodings that everybody has to
implement, like GB 18030, which has various thingums in it that
would otherwise not likely have been encoded in Unicode.
Let me give you another example: The North Korean character
encoding standard, KPS 9566-97, contains in it, among other
things, 3 characters spelling out KIM JONG IL in a special,
bolded font, and another 3 characters spelling out KIM IL SUNG
in that same font. Now if I was writing an "emulator" for
North Korean hardware using that character set, I could have
a problem, because the UTC (and WG2) declined to add those
6 characters to Unicode and ISO/IEC 10646. Now in that case,
for web pages, you could use the regular Hangul syllable
codes for "kim" "jong" "il" and so forth, and use <b></b>
markup on them, to get close. But if you are looking for
"exact representation", this might not be what you are after,
because there is no guarantee that simply bolding the
Hangul font on your machine has the same effect as the
emphasis for the 6 characters in question in the
KPS 9566-97 standard.
Now granted that case isn't as intractable as what you are
dealing with, because it doesn't involve inability to line
up columns in monospace printout.
But I think it illustrates another instance of appropriate
skepticism at this point about simply encoding compatibility
characters in Unicode for every character in every obscure
historic character encoding that people dig up.
I think you would need to answer that skepticism to get the
UTC on your side for encoding a subscript 10 as a single
On the other hand, there is so much compatibility dreck in
the standard already, maybe nobody would even notice. ;-)
This archive was generated by hypermail 2.1.5 : Wed Jan 16 2008 - 20:49:49 CST