Re: Dutch IJ, again

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue May 27 2003 - 16:43:39 EDT

  • Next message: Philippe Verdy: "Re: IPA Null Consonant"

    From: "Anto'nio Martins-Tuva'lkin" <antonio@tuvalkin.web.pt>
    > On 2003.05.25, 00:00, Philippe Verdy <verdy_p@wanadoo.fr> wrote:
    > > even if the Dutch language considers it as a single letter, in a
    > > way similar to the Spanish "ch"
    >
    > I see one major difference: When you apply extra wide inter-char
    > distance, you (should) get, f.i.:
    > K o r t r ij k and not K o r t r i j k
    > but E l c h e and not E l ch e
    > This is common practice in both spanish and dutch typography, ISTK.
    > I was told in this forum that the surest way to keep this working in
    > Unicode texts is to use "i<WJ>j" for Dutch and plain "ij" for other
    > languages.

    My opinion about this is not related to the use or non-use of joiner and disjoiner controls.

    I think it goes to the locale definition of breakers (I mean the set of breakers for sentences, lines, words, hyphenation):

    Shouldn't that go to the definition of locale-specific ***character (or character-clusters) breakers***, going beyond what Unicode can provide in a single and unified character model that just tries to represent international text independantly of the language ?

    After all Unicode mostly defines only the required abstract characters needed to encode a given strict, outside of any typographical considerations with fonts and style effects, but does not really work on the representation of locale-specific needs for specific typographical uses such as line justification...

    Once again, Unicode should not attempt to be a markup language. It only represents text as a linear stream of abstract characters encoded in strings that can be transmitted. Unicode is not specifying the typographic needs. This goes to other systems such as HTML, SGML, or XSLT and CSS, plus other internationalization standards such as transliteration rules, and domain specific conventions, or even the art of text translation...

    Regarding your request to handle ij specially in Dutch, nothing forbids a locale-aware rendering application to remap the i+j pair as a single ij character before rendering it, if the text is labelled as Dutch...

    So you could get with a few locale-specific chararacter-cluster breaking rules:
        K o r t r ij k and not K o r t r i j k
        B i j e c t i e and not B ij e c t i e
    (simply because i+j is a single combined Dutch ij character only if its not followed by a vowel)

    For the same reason, a French text would render with strict typography:
        B oe u f and not B o e u f
    (in this case it would render the oe ligature)

    Such approach is still much less complicated than what is actually needed for Brahmic scripts, and even worse for Thai! And it could handle the defficiencies of some conversions to legacy character sets, for example restoring the final form of a greek sigma when appropriate.

    So the only good question to ask is whever we can label the text with its language, using some markup system, or at least using the Unicode language tags needed as a possible interface for font renderers that cannot interpret a markup system...

    I would not be shocked to see the ligated or combined forms not rendered in a text simply because the text is incorrectly marked with the wrong language, or ecause such markup is simply not available. This exception is similar to the common approach consisting in rendering the text the best as we can with the tools we have, by using canonical or compatibility equivalences.

    But I see nothing in Unicode that would require the text to be encoded only with the Unicode prefered character, only because Unicode recommands it, but where in practice, other standards exist that mandate input methods or keyboards where such composition is widely impractical. The strict typographic rules cannot be applied without some smart algorithm, but the reader will always make the correct interpretation of text (this is the interpretation of text that Unicode standardizes, not its rendering).

    -- Philippe.



    This archive was generated by hypermail 2.1.5 : Tue May 27 2003 - 17:43:07 EDT