Re: Dutch IJ, again

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon May 26 2003 - 18:05:57 EDT

  • Next message: Thomas M. Widmann: "Re: Dutch IJ, again"

    From: "Karl Pentzlin" <karl-pentzlin@acssoft.de>
    > Am Montag, 26. Mai 2003 um 16:13 schrieb Pim Blokland:
    > PB> Because there ARE words in Dutch where the combination i+j is not
    > PB> the same as ? (e.g. "bijectie") ...
    >
    > In quality typography, does the "ij" in "bijectie" look different
    > from an ij ligature?
    > Is it recommended to write "bi[ZWNJ]jectie" when you don't use
    > U+0133 for "common" "ij"s?

    I do think also that use of ZWNJ or ZWJ within Dutch words is excessive and would break more than not using them. Typographically, there are very few cases where a "ligated" ij or IJ letter would look differently from a pair of letters i + j or I + J.

    The all lowercase ij letter uses internal kerning to allow the leg of the j to go below the i. This rule should be also true when "correctly" drawing the pair i + j (the difference only appears if kerning can't be used in limited environments such as monospaced fonts). However, even in this limited environment, there are too much legacy uses of a separated pair of letters even if a single monospaced glyph could have been used (for good reasons: these environments simply were not able to represent the kerned letter, and so even the legacy encoded text uses separate letters in their encoding and their display).

    For the all uppercase IJ letter, kerning is rarely used in most proportional fonts, and looks exactly the same as the decomposed letters I + J.

    So I think that all that needs to be addressed for Dutch is the case of the IJ letter used as an initial of a word. Is there any word in Dutch that starts with this combined letter, and must be titlecased as "IJ" and not "Ij" ? There does not seem to exist any "Ij" combined letter defined in Unicode or other legacy charset for titlecasing. This would mean that any initial pair IJ or ij must be interpreted as a single letter for Dutch, and it must effectively be titlecased as "IJ" and not "Ij" (which is currently possible only if we break the single letter into two separate letters, possibly combined with a ZWJ, a complication that the Unicode standard would require, despite there's no evidence it would be useful in legacy encodings for Dutch).

    We currently have no problem with "bijectie" whose lowercasing, uppercasing, titlecasing, and casefolding is appropriately "bijectie", "BIJECTIE", "Bijectie" and "bijectie", in both encodings that would use an incorrect single letter ij or IJ, or as separate letters in this case.

    So the only problem is if we must use a special casing rule for Dutch, which really seems to consider the single letter ij or IJ as decomposable into separate letters without causing problems, except for the equivalence of Titlecasing.

    Can someone look in Dutch dictionaries if there's any occurence of a word starting with "ij", where "ij" is NOT seen as a single vowel letter, but as two separate letters, i then j. An evidence would be an occurence of a capitalized "Ij..." Dutch word, or a "ij" Dutch word splitted in two lines between the i and j letters, considering that i is in a separate syllable...

    I think there is no such evidence, and so we could create a conditional SpecialCasing rule for Dutch (in the "Locale-sensitive mappings" section:

    #code; lower; title; upper; (condition list;)? #comment
    # ================================================================================

    # Dutch

    # I and J are often used instead of the IJ letter in most legacy Dutch text.
    # The single IJ letter uses the standard correct rule.
    # The following rule only corrects the Titlecasing rule (for words that begin with I and J)
    # only if both letters use the same case (distinct cases are always considered as separate letters)
    # using the standard casing rules:
    #0049 006A; 0069 006A; 0049 006A; 0049 004A; # LATIN CAPITAL LETTER I, LATIN SMALL LETTER J
    #0069 004A; 0069 006A; 0049 006A; 0049 004A; # LATIN SMALL LETTER I, LATIN CAPITAL LETTER J

    0049 004A; 0069 006A; 0049 004A; 0049 004A; nl; # LATIN CAPITAL LETTER I, LATIN CAPITAL LETTER J
    0069 006A; 0069 006A; 0049 004A; 0049 004A; nl; # LATIN SMALL LETTER I, LATIN SMALL LETTER J

    This approach however is using a context. We could eventually reuse the "After_I" condition defined for Turkish and Azeri which matches the uppercase letter in the input, and add a new similar definition of "After_i" to match the lowercase i letter only in the input, because the first letter of the pair uses the standard rule (so special casing occurs only for j), and so we would use this second definition:

    004A; 006A; 004A; 004A; nl After_I; # LATIN CAPITAL LETTER J
    006A; 006A; 004A; 004A; nl After_i; # LATIN SMALL LETTER J

    The problem here is that the titlecase mapping is not defined for the second character of a word, which normally uses the lowercase mapping with the current rules... So this creates a complication currently not described in the normative document.
    I do think that the first definition is more clear. But there currently exists no such special casing rule for pairs of characters in the current normative "SpecialCasing-4.0.0.txt" file, and both approaches may simply not work with existing conforming implementations of case mappings...



    This archive was generated by hypermail 2.1.5 : Mon May 26 2003 - 18:52:46 EDT