Re: Folding algorithm and canonical equivalence

From: Mark E. Shoulson (
Date: Sun Jul 18 2004 - 21:25:42 CDT

  • Next message: Jony Rosenne: "RE: Folding algorithm and canonical equivalence"

    Michael Everson wrote:

    > At 13:00 +0300 2004-07-18, Jony Rosenne wrote:
    >> > Jony is arguing to extend AccentFolding to Hebrew (fold to
    >>> unpointed). His
    >>> suggestion is to fold *all* combining marks used with Hebrew
    >>> in that case.
    >>> I want to double check that he really means all combining
    >>> marks in the
    >> > Hebrew block, or just some of them.
    >> I did mean all. All points and cantillation marks in Hebrew are
    >> optional.
    > In the Hebrew language, perhaps. But in other languages, like Yiddish,
    > which use the Hebrew script, at least some points are NOT optional,
    > and "dropping" them causes textual corruption and loss of data.

    Mm, true. Though for all that, a lot of Yiddish I've seen is also
    written without vowel-points. So the patah-alef and qamats-alef vowels,
    and the yod-yod-patah vs. yod yod diphthongs, must be distinguished from
    context, like everything else.

    Even so, there's probably some language out there that requires some
    diacritics left in place on Hebrew letters (I don't know much about
    other languages written in Hebrew letters; Elain Keown knows that
    better). But this folding is *supposed* to lose data. Even in Hebrew,
    folding away all the vowels leaves something probably readable, but with
    less actual information (e.g. foreign names or obscure words might not
    be recoverable with 100% accuracy). And folding away diacritics of
    Latin letters *certainly* causes data loss and textual corruption in
    some languages. I was under the impression that losslessness was a
    non-goal of this folding operation, and in fact Hebrew (and even
    Yiddish) survives its scourge considerably better than a lot of
    Latin-written languages.


    This archive was generated by hypermail 2.1.5 : Sun Jul 18 2004 - 21:26:29 CDT