Mon Oct 25 1999

Jonathan Rosenne wrote:

> >Is there a minimal pair in Hebrew that shows that KAF/FINAL KAF are
> >different letters?
> What do you mean? No one denies that they are different shapes of the same
> letter, but we say that you have to encode them at source because the shape
> cannot be determined algorithmically in a practical way.

Thanks, I was hoping you'd say that :-)

> Arno,
> [...]
> your arguments are correct but are barely relevant and misleading.

On the contrary, I'm very grateful for Arno's details, and they are
certainly relevant to my argument.

Let's draw some conclusions:

(1) The situation of LONG-S versus ROUND-S is quite parallel to that
    between FINAL-KAF and KAF. In both cases we clearly see the same
    letter, but in different shapes. In a sense, they are
    glyph-variants---but chosing the wrong one would be considered
    incorrect spelling.

    In both cases, it is in principle possible to make the decision by
    using a large dictionary (syllable segmentation algorithms as in
    TeX use hyphenation rules obtained from one), but clearly that is
    not a feasible solution for, say, a simple mail reading program.
    (But it would be feasible, for instance, if a German publisher
    wants to print a book in Fraktur or Suetterlin from an electronic
    manuscript that doesn't distinguish round-s and long-s.)

(2) Therefore, FINAL-KAF and LONG-S need to be encoded. Not, as has
    been hinted, because they come from an ancient legacy encoding,
    but because they are necessary, here and now.

(3) There still remains the question why LONG-S has a compatibility
    decomposition to S, while FINAL-KAF doesn't. I'm not sure what
    the consequences of this mapping are, but one theory would be this:
    When you search for a string in a word-processor, I would like "s"
    to match all of "s", "S", and "long-s". How is this in Hebrew?
    Would you want to find a match with FINAL-KAF if you typed a KAF
    in the search pattern?

(4) The long philosophical discussions on "What is a letter?" "What is
    a character?" "Who am I?" may be fun, but have little impact on
    the practice of Unicode.

    The distinction between glyph variants that do not need to be
    encoded, glyph variants that need to be encoded, and genuinely
    different letters, is one of locale and time. Principles will
    count much less than how those `symbols' are used in the various
    scripts that they are used for.

    Which `symbols' have been encoded in Unicode with separate code
    points and where seems to be more a function of previous encodings,
    of national sensitivities, and national pride, than of any firm


