Re: IDN Security

From: Mark E. Shoulson (mark@kli.org)
Date: Tue Feb 15 2005 - 08:40:27 CST

  • Next message: Mark E. Shoulson: "Re: IDN Security"

    This is a good point. If there is a Hebrew punctuation character that
    deserves to exist in IDNs (and I'm not saying there is), it is GERSHAYIM
    (and possibly GERESH). A domain תנ״ך.com makes far more sense than
    תנך.com. Better examples are available (תנ״ך is almost comfortable
    un-gereshed these days). There are a lot of abbreviations used in
    Hebrew in everyday usage that include this mark. It is worth noting
    that it does *not* mean that the word is not to be read as a word. On
    the contrary, many of these acronyms have standard pronunciations, and
    are even pluralized (and occasionally even conjugated) as if they were
    normal words. e.g. just the other day I saw in a newspaper headline the
    "word" ח״כים for the plural of ח״כ = חבר כנסת = Member of Knesset. Note
    that in this case the word is *not* meant to be pronounced, apparently,
    since the KAF doesn't go into final form as it usually does for
    pronounced acronyms, like תנ״ך (Hebrew Bible, an acronym for Torah[law],
    Neviim[prophets], Ketuvim[scriptures])... and the adjective תנ״כי =
    "Biblical" is quite normal. Sometimes these words have even passed into
    verbs, as in דו״ח for דין וחשבון (lit. judgement and accounting; used to
    mean "report" or "traffic ticket") passing into verbdom also, meaning
    "to report". I'm not sure if it retains its GERSHAYIM when so
    conjugated though (לְדַוֵּ״חַ?).

    (It's true that the Hebrew MAQAF could theoretically be useful in a
    domain name as well, but an ordinary HYPHEN-MINUS is a perfectly
    acceptable substitute in this setting, and is what most people would use
    anyway).

    The problem, as correctly pointed out, is that I've pretty much *never*
    seen the Unicode GERHSAYIM codes used properly for this. This is
    probably because most stuff still dates back to ISO-8859 days, which had
    no such codepoint. Probably most people don't even know it's there; I
    am not surprised it isn't on a standard keyboard (I use my own weird
    personalized keyboard for Hebrew). *Everyone* just uses DOUBLE-QUOTE
    (also in cases of Hebrew abbreviations transcribed into Latin letters,
    as in Z"L, B"H, HY"D, BS"D, etc). Run the "date" command on my Linux
    system with the locale set to he_IL and you get:

    ג' פבר 15 09:35:43 EST 2005

    (after converting from ISO-8859-8 to UTF-8). Note the apostrophe
    instead of GERESH after the weekday number. Well, this is from
    ISO-8859-8, which had no GERESH... which I suppose is the point: the
    locale for he_IL is not even Unicode!

    If I were registering a domain name, from a linguistic perspective and
    from the point of view of naming it the right thing, I'd definitely want
    GERSHAYIM and probably GERESH available. But it would be a rare person
    who could enter it correctly. Still, if you gave me the choice, I'd
    prefer that they be included; maybe it will encourage correct usage in
    the future.

    N.B. throughout this, GERESH and GERSHAYIM refer to the *punctuations*
    of those names, not to be confused in any way with the *accents* of the
    same names.

    ~mark

    Cary Karp wrote:

    > Quoting Mark E. Shoulson:
    >
    >> I recognize this is opening a can of worms... but then, it was you
    >> that opened it. I'm looking at the idn-chars.html page, and I have a
    >> few questions about (naturally) the Hebrew script (since that's one
    >> I'm familiar with).
    >
    >
    > I have another question about the IDN implementation of the Hebrew
    > script. Given that IDN security concerns stand in direct proportion to
    > the size of the character repertoire in actual use, I trust that it is
    > relevant (at least initially) to the present topic heading.
    >
    > The HEBREW PUNCTUATION GERSHAYIM U+05F4 <״> appears in the penultimate
    > position in a sequence of Hebrew characters that is not to be read as
    > a word. Since such things as acronyms are regularly used as domain
    > labels, it thus appears necessary for any registry supporting Hebrew
    > to include this code point in the corresponding character table. If
    > so, this is a good example of a situation where "an exception is
    > appropriate" to the general stricture on "punctuation characters",
    > stated in the ICANN Guidelines for the Implementation of
    > Internationalized Domain Names.
    >
    > The problem is that a standard Hebrew keyboard doesn't include this
    > character, which is normally replaced by a QUOTATION MARK U+0022.
    > Anyone entering an IDN including U+05F4 via a keyboard will therefore
    > be likely to mistype it as U+0022, causing it to fail. It is possible
    > to get an IDN string containing a quotation mark throughToASCII by
    > leaving the UseSTD3ASCIIRules flag unset (which is counter to a
    > "should" point in the ICANN Guidelines). The resulting string contains
    > a literal quotation mark. Since it is this string that is actually
    > included in the zone file, the name server will need to load what it
    > is likely to reject as a malformed name regardless of any IDN
    > considerations.
    >
    > Can someone who has detailed understanding of Hebrew orthography
    > please comment on the necessity of the gershayim in the context
    > described above. If it cannot comfortable be done without, how can one
    > offset the confusion that seems inevitable given the alternate
    > orthography on which the local keyboard is based? Are there other
    > code points listed as punctuation in the Unicode charts that are
    > similarly necessary for the IDN support of established orthographic
    > convention in the languages for which they are used?
    >
    > /Cary
    >
    >



    This archive was generated by hypermail 2.1.5 : Tue Feb 15 2005 - 08:41:36 CST