ZWJ, ZWNJ and VS in Latin and other Greek-derived scripts

From: Ruszlan Gaszanov (
Date: Thu Jan 25 2007 - 15:13:28 CST

  • Next message: Jukka K. Korpela: "Re: U+00BA and U+00AA (was: "Re: Public Review Issue Unicode Technical Report #25, "Unicode Support for Mathematics"")"

    There's one thing I don't quite understand. Why do we keep encoding variations and combinations (ligatures) of the same base letters from Latin, Greek, Cyrillic end similar scripts as separate code points, when mechanisms already exist to compose them from base characters?

    Consider +A8Y- +ADw-U+-03C6+AD4- (GREEK SMALL LETTER PHI) and +A9U- +ADw-U+-03D5+AD4- (GREEK PHI SYMBOL) for instance. +ADw-U+-03D5+AD4- only exists to enforce +ACI-straight+ACI- glyph in mathematical context. Wouldn't it be more sensible to apply, let's say VS1 +ADw-U+-FE00+AD4- to +ADw-U+-03C6+AD4- to enforce +ACI-loopy+ACI- glyph and VS2 +ADw-U+-FE01+AD4- to enforce +ACI-straight+ACI- glyph where distinction is important, while leaving it to the font designer to chose the glyph for pain +ACI-VS-less+ACI- +ADw-U+-03C6+AD4-.

    Or, let's take all those spacing/combining subscript/superscript forms and so-called +ACI-mathematical alphabets+ACI- - couldn't the same thing have been accomplished by specific VS? No, UTC can't afford to +ACI-garbage+ACI- codespace with useful characters like +ACI-Apple+ACI- and +ACI-Windows+ACI- logos (which are not just trademarks, but also important references to keyboard keys and GUI elements), but it has no problem with allocating a wagonload of codepoints for things that can be easily done by existing mechanisms.

    Then there are ligatures. Some (like Latin W +ADw-U+-0057+AD4- and Cyrillic +BCs- +ADw-U+-042B+AD4-) are now considered as base letters in their own rights. Others (like Latin +AMY- +ADw-U+-00C6+AD4- and +AVI- +ADw-U+-0152+AD4-) may be considered either base letters or typographic/writing style variations. While others still (like Latin ++wE- +ADw-U+-FB01+AD4-) are usually regarded purely as typographic styles. Historically, there many more ligatures of Latin, Cyrillic, Greek, Coptic, Gothic etc. were used then are currently encoded as recomposed characters - sometimes as writing/typographic style variations, in other instances as characters with distinct semantics. Many are encoded as recomposed characters at PUA codepoints by different fonts/conventions (e.g. MUFI).

    But again, why aren't we using existing mechanisms - namely ZWJ +ADw-U+-200D+AD4- and ZWNJ +ADw-U+-200C+AD4- for handling ligatures. Thus, ZWJ connecting base letters would enforce ligature shaping, even when the font would otherwise render separate characters (hence, +ADw-U+-00C6+AD4- for instance, could become canonically equivalent to +ADw-U+-0041 U+-200D U+-0045+AD4-). By the same token, ZWNJ could enforce rendering of two separate glyphs, where a ligature glyph would otherwise be presented (e.g. +ADw-U+-0066 U+-200C U+-0069+AD4- would explicitly enforce use of separate f and i glyphs rather then +ADw-U+-FB01+AD4-).


    This archive was generated by hypermail 2.1.5 : Thu Jan 25 2007 - 15:15:32 CST