re: What justification for separately encoding two forms of lowercase sigma

From: verdy_p (
Date: Sun Sep 06 2009 - 16:57:31 CDT

  • Next message: Doug Ewell: "Re: Visarga, ardhavisarga and anusvara -- combining marks or not?"

    "Shriramana Sharma" wrote:
    > Correct me if I am wrong, but the single Greek letter sigma is said to
    > have two different forms, one in word-final and other in other places.
    > These are encoded in Unicode as 03C2 and 03C1 respectively.
    > Now are these two symbols not just two different ways of writing the
    > same character? If yes, how can they be separately encoded? Is it only
    > to keep compatibility with some earlier standard? Or can these two
    > actually be considered as two different characters?

    It would be simple if the correct letter form could be decided from a simple context, not exceeding the state of
    some properties of the previous character or the one for the next character. In that case, a solution like Arabic
    contectual letter forms would work.
    Even if there are more advenced text rendering engines and font formats that can manage more complex cases (with
    substitution rules that should not have to apply some canonical reordering of "equivalent" encodings to cover all
    the possibilities), you cannot depend only on these technics.
    For the same reason, in the Latin script, there's the case of the long s (which is no longer used in modern
    languages) and whose position that cannot be reliably defined by simple algorithms, because its has always depended
    on the authors, and these positions were also not respected by the same author in the same texts.
    The case of the Greek sigma is quite similar: it has some tricks where the final form of the lowercase sigma needs
    to be present in the middle of a word, or even the opposite. There's also the need for backward compatibility with
    legacy ISO encodings that treated these two characters as distinct (because they coould not depend on more advanced
    contextual rendering with the simple on-to-one mappings from characters to glyphs in almost all legacy fonts for
    For these reasons, the two letters need to be considered as distinct in lowercase, even if their capitalized form
    are to the same capital Sigma letter.

    (In fact, to help prevent the loss of information, the capitalization of text by mapping algorithms should no longer
    be performed at all on texts, when this is just needed either as a rendering style, or for collation and search
    purposes, unless the capitalization is absolutely required by the standard orthography of languages: the capitals
    are to be treated as distinct from the small letters; this is especially important for dictionaries, and can explain
    why, for example, instances were created with significant case including for the first letter of
    article names; instead the search facility can cope with those difference, and can help find the other articles and
    provide links to the other articles when appropriate).

    So if you accept that case is significant, you have to accept that other letter forms are also significant in
    multicameral scripts like Latin and Greek (which are not purely bicameral). Similar considerations could have been
    applied to Arabic, but for legacy reasons, the contextual forms are not made distinct and this creates the
    additional encoding difficulties to control the letter forms with extra joiners/disjoiners, that make no sense in
    Arabic by themselves, unless you consider that the letter + the (dis)joiner are the way we encode "atomically" the
    letter forms (but this is not the way it works: the (dis)joiners have to be inserted contextually, and this does not
    help automating the text input:

    There should exist a way to remap a text that contain unnecessary (dis)joiners to its canonical form (according to
    the existing joining rules), and make the reverse without changing the text. But the joining properties are not
    considered in the current specification of the standard canonicalisation forms of Unicode. I think that this should
    be corrected by adding another canonicalization mapping specific to Arabic (even if this "changes" apparently the
    normalized equivalences). Similar algorithms should be developed as well for other Asian scripts that use joining
    controls. For Latin, the usage of compatibility mappings (NFKC/NFKD) should also be deprecated in favor of the
    systematic use of joiner controls where appropriate (for example for the ligatures fi/fl/ffi/ffl/... by remapping
    the ligatures with equivalent sequences using normal letters and joiners, and rempping the non-ligatured letters
    with the disjoiners, and also making these equivalent under the new canonicalization schemes).

    Note: I'm not advocating for the change of the standard canonicalization algorithms, but for the development of
    better (and still safe) canonicalization algorithms, which must still be stable over the existing NFC/NFD
    equivalences. These would simplify a lot the development of other tools like rendering engines, simplifying the
    development of fonts (less substitution rules needed in font tables), input methods and keyboard drivers, plain text
    searches (using NFKC/D is really a mess there giving too many false hits). And these algorithms should be open for
    later changes.

    Under the new canonicalization schemes, we could also support the Hangul script in a much simpler way (the
    distinction of initial and final consonnants to exhibit the delimitation of composition squares is artificial, and
    does not work as intended with older Korean texts or complex syllables). With this tool, it would be possible to
    reliably remap the much simpler alphabet (using a simpler set of jamos only) to its final form preferred by the
    modern usage of the script (using the full set of jamos and preencoded pseudo-syllables). And finally, we would be
    able to assert that the collation tables are setup correctly and consistantly, including after tailoring (something
    that cannot be asserted, as of today, except possibly with the default collation table that has been tweaked
    manually, and sometimes with bugs not detected in early versions of the DUCET).

    This archive was generated by hypermail 2.1.5 : Sun Sep 06 2009 - 17:00:42 CDT