RE: ASCII fallbacks for Unicode characters

From: Addison Phillips (AddisonP@simultrans.com)
Date: Wed Aug 18 1999 - 16:22:19 EDT


Of course you could adopt the already-extant SGML entities for many of
these. At least there are a lot of programs that can read these and render
them in the local character set (if available).

Just an idea.

Addison
        __________________________________________

        Addison Phillips
        Director, Globalization Consulting
        SimulTrans, L.L.C.

        AddisonP@simultrans.com (Internet email)
        http://www.simultrans.com (website)

        "22 languages. One release date."
        __________________________________________

-----Original Message-----
From: Markus Kuhn [mailto:Markus.Kuhn@cl.cam.ac.uk]
Sent: Wednesday, August 18, 1999 12:50 PM
To: Unicode List
Subject: ASCII fallbacks for Unicode characters

We often use ASCII characters or sequences of ASCII characters to
represent characters for which Unicode actually has a proper character
available. For many applications, it would be nice if we could use the
proper Unicode characters for those who can adequately read these
characters, while some software takes care of substituting the
traditional ASCII representation that we used anyway before Unicode was
available in situations where no Unicode can be represented.

I'd like to put together an ASCII fallback table that does exactly that
for at least the most frequently needed characters for which we do use
fallbacks in daily life already. Here is a start:

From CP1252:

 "EUR" <- 0x20AC EURO SIGN
 "'" <- 0x201A SINGLE LOW-9 QUOTATION MARK
 "\"" <- 0x201E DOUBLE LOW-9 QUOTATION MARK
 "..." <- 0x2026 HORIZONTAL ELLIPSIS
 "+" <- 0x2020 DAGGER
 "^" <- 0x02C6 MODIFIER LETTER CIRCUMFLEX ACCENT
 "S" <- 0x0160 LATIN CAPITAL LETTER S WITH CARON
 "<" <- 0x2039 SINGLE LEFT-POINTING ANGLE QUOTATION MARK
 "OE" <- 0x0152 LATIN CAPITAL LIGATURE OE
 "Z" <- 0x017D LATIN CAPITAL LETTER Z WITH CARON
 "'" <- 0x2018 LEFT SINGLE QUOTATION MARK
 "'" <- 0x2019 RIGHT SINGLE QUOTATION MARK
 "\"" <- 0x201C LEFT DOUBLE QUOTATION MARK
 "\"" <- 0x201D RIGHT DOUBLE QUOTATION MARK
 "o" <- 0x2022 BULLET
 "-" <- 0x2013 EN DASH
 "-" <- 0x2014 EM DASH
 "~" <- 0x02DC SMALL TILDE
 "TM" <- 0x2122 TRADE MARK SIGN
 "s" <- 0x0161 LATIN SMALL LETTER S WITH CARON
 ">" <- 0x203A SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
 "oe" <- 0x0153 LATIN SMALL LIGATURE OE
 "z" <- 0x017E LATIN SMALL LETTER Z WITH CARON
 "Y" <- 0x0178 LATIN CAPITAL LETTER Y WITH DIAERESIS

From VT100 special graphics:

 "-" <- 0x2500 BOX DRAWINGS LIGHT HORIZONTAL
 "|" <- 0x2502 BOX DRAWINGS LIGHT VERTICAL
 "+" <- 0x2518 BOX DRAWINGS LIGHT UP AND LEFT
 "+" <- 0x2510 BOX DRAWINGS LIGHT DOWN AND LEFT
 "+" <- 0x250C BOX DRAWINGS LIGHT DOWN AND RIGHT
 "+" <- 0x2514 BOX DRAWINGS LIGHT UP AND RIGHT
 "+" <- 0x253C BOX DRAWINGS LIGHT VERTICAL AND HORIZONTAL
 "+" <- 0x251C BOX DRAWINGS LIGHT VERTICAL AND RIGHT
 "+" <- 0x2524 BOX DRAWINGS LIGHT VERTICAL AND LEFT
 "+" <- 0x2534 BOX DRAWINGS LIGHT UP AND HORIZONTAL
 "+" <- 0x252C BOX DRAWINGS LIGHT DOWN AND HORIZONTAL
 "<=" <- 0x2264 LESS-THAN OR EQUAL TO
 ">=" <- 0x2265 GREATER-THAN OR EQUAL TO
 "/=" <- 0x2260 NOT EQUAL TO

Others:

 "-" <- 0x2010 HYPHEN
 "-" <- 0x2011 NON-BREAKING HYPHEN
 "-" <- 0x2012 FIGURE DASH
 "-" <- 0x2015 HORIZONTAL BAR
 "-" <- 0x2212 MINUS SIGN
 "/" <- 0x2215 DIVISION SLASH
 "<-" <- 0x2190 LEFTWARDS ARROW
 "->" <- 0x2192 RIGHTWARDS ARROW
 "<=" <- 0x21D0 LEFTWARDS DOUBLE ARROW
 "=>" <- 0x21D2 RIGHTWARDS DOUBLE ARROW
 "'" <- 0x2032 PRIME
 "''" <- 0x2033 DOUBLE PRIME
 "'''" <- 0x2034 TRIPLE PRIME
 "<<" <- 0x226A MUCH LESS-THAN
 ">>" <- 0x226B MUCH GREATER-THAN

All these are much better than the traditional question mark.

I guess user of Cyrillic might be quite happy with the transliteration
that they get when they strip the 8th bit from KOI8 (as this is what
they have to know for keyboard entry anyway), and the CP437 block
graphics could also be represented by -|+ nicely. Even more can be
generated by using the decomposition tables and dropping the combining
characters.

In case someone even wants to have fallbacks for the Latin-1
characters, this could look like these:

       ! c ? ? Y | ? " (c) a << - - (R) -
     +/- 2 3 ' u P . , 1 o >> 1/4 1/2 3/4 ?
   A A A A Ae Aa AE C E E E E I I I I
   D N O O O O Oe x Oe U U U Ue Y Th ss
   a a a a ae aa ae c e e e e i i i i
   d n o o o o oe : oe u u u ue y th ij

In

  ftp://ftp.informatik.uni-erlangen.de/pub/doc/ISO/charsets/iso2asc.txt

you'll find even an algorithm that tries to salvage alignment in
monospaced output that is endangered by substituting several ASCII
character for one non-ASCII character (by removing surplus spaces in a
clever way).

Further suggestions welcome ...

Markus

--
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:51 EDT