RE: Character disunification (was: atnah hafukh)

From: Jony Rosenne (
Date: Tue Dec 21 2004 - 00:09:31 CST

  • Next message: Lars Kristan: "RE: Roundtripping Solved"

    Since this is not a Hebrew issue, I am copying the general list.

    The issue is how should Unicode address disunification of an encoded
    character, in the situation where some users (probably most) do not make the
    distinction and some do.

    I think that in such cases the UTC should address the transition and the
    needs of those users that do not make the distinction. We now have a
    sizeable Unicode legacy. Gone are the days when legacy was just pre-Unicode
    encodings, and the UTC should recognize this change.

    My proposal was and is that we need three characters: one for the
    "ambiguous", existing character, and two for the distinct ones. Dean is
    making the same proposal.

    Unicode is a character standard, rather than a glyph collection, and should
    take a character oriented approach, rather than the glyph centric approach
    advocated by, for example, Michael:

    > [] On Behalf Of Michael Everson
    > Sent: Monday, December 20, 2004 11:26 PM
    > To:
    > Subject: [hebrew] Re: atnah hafukh
    > It's a disunification scenario: There are two distinct characters,
    > with different glyphs, in orthographies which make the distinction.
    > In other orthographies which do not make the distinction, it doesn't
    > necessarily matter what the glyph is.

    Firstly, the glyph does matter, and secondly, the characters necessarily do


    > -----Original Message-----
    > From:
    > [] On Behalf Of Mark E. Shoulson
    > Sent: Tuesday, December 21, 2004 5:58 AM
    > To: Dean Snyder
    > Cc:
    > Subject: [hebrew] Re: atnah hafukh
    > Dean Snyder wrote:
    > >We should be encoding the maximal union of all character mergers (and
    > >splits). To put it in other words, we should be taking a synchronic,
    > >atomistic view on intra-script character repertoire and not
    > a diachronic,
    > >collapsing one.
    > >
    > >
    > You did suggest something like this during one of the various Hebrew
    > character debates. But it doesn't hold up well in general. By that
    > logic, we also now need to encode LATIN LETTER U OR V, LATIN
    > J (both in CAPITAL and SMALL versions), plus LATIN SMALL
    > SHORT S (though we could probably manage to use just U+0073
    > for that and
    > encode SHORT S separately). But I don't think anyone would
    > want such a
    > confusing state of affairs. Spelling things right is hard
    > enough when
    > there's only *one* choice for each letter!

    The example isn't relevant. These disunifications are very old - you could
    have added C/G - and the I and U are commonly used for the ambiguous


    > ~mark

    This archive was generated by hypermail 2.1.5 : Tue Dec 21 2004 - 00:11:51 CST