Re: Dutch IJ, again

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue May 27 2003 - 22:47:34 EDT

  • Next message: Doug Ewell: "Re: Web Form: Problems - Hebrew in Java applet"

    Philippe Verdy continued:

    > From: "Mark Davis" <mark.davis@jtcsv.com>
    > > From: "Anto'nio Martins-Tuva'lkin" <antonio@tuvalkin.web.pt>
    > > > On 2003.05.25, 00:00, Philippe Verdy <verdy_p@wanadoo.fr> wrote:
    > > > > even if the Dutch language considers it as a single letter, in a
    > > > > way similar to the Spanish "ch"
    > > >
    > > > I see one major difference: When you apply extra wide inter-char
    > > > distance, you (should) get, f.i.:
    > > > K o r t r ij k and not K o r t r i j k
    > > > but E l c h e and not E l ch e
    > > > This is common practice in both spanish and dutch typography, ISTK.
    > > > I was told in this forum that the surest way to keep this working in
    > > > Unicode texts is to use "i<WJ>j" for Dutch and plain "ij" for other
    > > > languages.
    > >
    > > Well, I don't know who told you, but WORD JOINER only affects
    > > linebreak behavior, not intercharacter spacing.
    >
    > I think he meant <ZWJ> (the zero-width joiner) used as as markup to
    > create a ligated variant of a pair of characters in some languages
    > that offer two very distinct forms (I think about Brahmic scripts
    > such as Devanagari)...

    No, not ZWJ, either.

    U+2060 WORD JOINER (WJ) impacts line breaking behavior -- not the
         applicable concept here.
         
    U+200D ZERO WIDTH JOINER (ZWJ) impacts cursive connection and/or
         ligation -- not the applicable concept here.
         
    U+034F COMBINING GRAPHEME JOINER (CGJ) is the relevant character.
    From Unicode 4.0:

      "U+034F COMBINING GRAPHEME JOINER is used to indicate that
       adjacent characters are to be treated as a unit for the
       purposes of language-sensitive collation and searching."
       
    That function was deliberately limited by the UTC to the status
    of such digraphs for searching and sorting, as that was the only
    well-defined requirement for the character.

    However, as this thread has hinted, there could, in principle,
    be multilingual contexts where there would be other legitimate
    reasons for treating a digraphic ij (as for Dutch) distinct from
    a non-digraphic ij sequence (as for Spanish). That is the same
    kind of argument which led to encoding of U+034F for collation.

    One can imagine an implementation of automatic letterspacing,
    such that a sequence marked explicitly as a digraph would not
    expand, but that one not so marked would expand. But such
    distinctions would only need to be made in the rather dubious
    conditions of: A) Multilingual text that is also B) marked
    explicitly for language and that also C) requires different
    rules for letterspacing language-by-language. Under such
    circumstances, you could indicate the differences for <ij>
    either by making use of the U+0133 ij digraph character for
    one and <i,j> for the other, or you could indicate the
    differences by <i,CGJ,j> versus <i,j>. The first approach
    would likely work more easily with existing software, but
    results in a problematical representation of Dutch data.
    The second is a more generic Unicode approach, but would
    likely be ignored by most software.

    In any case, the much more likely situation would be software
    that did letterspacing for fine typography based just on
    Dutch rules. It would not *need* any markup of <i,j>
    sequences, since it would be looking for and special-casing
    the sequences, anyway.

    --Ken



    This archive was generated by hypermail 2.1.5 : Tue May 27 2003 - 23:31:45 EDT