Re: Too narrowly defined: DIVISION SIGN & COLON

From: Julian Bradfield <jcb+unicode_at_inf.ed.ac.uk>
Date: Thu, 12 Jul 2012 09:44:01 +0100

[ Please don't copy me on replies; the place for this is the mailing
  list, not my inbox, unless you want to go off-list. ]

On 2012-07-11, Hans Aberg <haberg-1_at_telia.com> wrote:

>Unicode has added all the characters from TeX plus some, making it
>possible to use characters in the input file where TeX is forced to
>use ASCII. This though changes the paradigm, and it is a question of
>which paradigm one wants to adhere to.

This doesn't seem to make much sense, or have much truth, to me.

TeX does not have a notion of character in the Unicode sense. TeX is a
(meta-)programming language for putting ink on paper. It ultimately
produces instructions of the form "print glyph 42 from font cmr10 at
this position". It does not know or care whether the glyph happens to
be a representation of some Unicode character. (It also isn't tied to
ASCII for its input - when I first used TeX, it was on an EBCDIC
system.)

There are many characters that TeX users use that are not in
Unicode. Indeed, you can't even correctly represent the name of the
system in Unicode, or any other plain text system - an entirely
deliberate choice by Knuth to emphasise that TeX is a typesetting
program, not a text representation format.

Because TeX is agnostic about such matters, one can set up any
convenient encoding for the input data (which is really the source
code of a program). For example, I have written documents in ASCII,
Latin-1, Big5, GB, UTF-8 and probably others. This is very convenient;
but it's only a convenience.

If one uses UTF-8, then one has the problem of how to deal with the
case where Unicode trespasses on TeX's territory, by specifying font
styles.
This is not hard: for example, the obvious thing to do is to
arrange for the Unicode MATHEMATICAL SMALL ITALIC M to be an
abbreviation for \mathit{m}, and so on.
Note, incidentally, that this is not the same as the meaning of a
plain ASCII (or EBCDIC) "m" in TeX. In TeX math mode, the meaning of
"m" is dependent on the currently selected math font family: just as
in plain text, the font of of "m" depends on the currently selected
text font.

One problem, of course, is that there is no MATHEMATICAL ROMAN set of
characters. This is one of the biggest botches in the whole
mathematical alphanumerical symbol botch. If you encode semantic font
distinctions without requiring the use of higher-level markup, then
you need to encode also letters that are semantically distinctively
roman upright. The square root of -1 cannot be italicized in the
statement of a theorem, unlike all the "i"s that appear in the text of
the theorem. Yet Unicode provides no way to mark this semantic
distinction between the characters, and has to rely on the
higher-level markup distinguishing maths (to which some font style
changes should not be applied) from text (in which they should).

A more general problem is that which font styles are meaningful,
depends on the document. For example, I give lectures and talks, and I
set my slides in sans-serif. As I don't (usually) use distinctive
sans-serif symbols in my work, the maths is all in sans-serif
too: form, not content. But what then should I see if I type a Unicode
mathematical italic symbol in my slides? Serif, or sans-serif?

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
Received on Thu Jul 12 2012 - 03:48:56 CDT

This archive was generated by hypermail 2.2.0 : Thu Jul 12 2012 - 03:48:59 CDT