Re: Query for Validity of Thai Sequence

From: Philippe Verdy (
Date: Sat Feb 10 2007 - 13:23:54 CST

  • Next message: Doug Ewell: "Re: Autodetection of CP437 vs. Latin-1"

    From: "Richard Wordingham" <>
    >> Your question is similar to asking if the sequence string "qzlkqw" is
    >> valid using ony Latin consonnants;
    > No. The correct Thai analogue of "qzlkqw" as intended would be,say, <TO
    > TAO, SARA I, NIKHAHIT, TO TAO, SARA U, NIKHAHIT>. And both are potentially
    > interpretable, though. Are you sure no-one has used "qzlkqw" as his private
    > notation for a hypothetical PIE compound **k‏þl̥k̂ku? (I'd prefer
    > **tkl̥k̂ku-.)

    i did not reference Thai in my comment above, but languages that normally use the Latin script. This was a highly improbable and random sequence just invented for the puropose of the comment, showing that despite such sequence will most probably not fit within latin-based writing systems, and could be interpreted as being invalid in those languages using it, it is not invalid for Unicode.

    The question was about validity of a sequence of Unicode codepoints, and the reply is that, for Unicode, it is perfectly valid and compliant. This is not an unicode issue but a question to give to linguists for Thai and Pali languages, and those studying the various orthographic systems in that region for this Thai script and its associated writing systems.

    In my opinion, each writing system is specific to each language and is not shared, although there may be similarities (in fact just tolerances to allow the including of foreign words or names). i don't really call that "adaptation", because importing foreign words has always been difficult (for example there's NO agreed convention when importing Polish names that include a stroked letter L into French or English with the latin script, despite all these languages use the "same" Latin script: the difficulty is that they don't use the same subset of the latin script, and that there's no normative behavior defined in each writing system to handle the case of letters outside of their respective subset of the Latin script.

    For this reason, I really think that each writing system is specific to a (language, script) pair. When there are variants, this causes orthographic differences (which may be reformed over time to accept other variants so that these become merged later into a more precise writing system, but orthographic differences are still remaining). One good example of this is the case of the "ae" ligature in french : the question about its existence in the set or its unification with the letters digraph remains unsolved, becuse it depends on the usage context or history (some people may argue that this indicates that there are distinctions of language, but this is not demonstrated by examples, because these written differnces are not reflected in the oral speech).

    All these issues are for linguists, and it can't be decided at the Unicode level. For this reason, Unicode really needs to consider all these sequences as valid, even if some input methods specially tuned for inputting specific languages are restricting the input and rejecting some sequences: they are invlid only within those input methods, but not at the Unicode level.

    As well, there may be technical limitations, if the Unicode string is the result of a transcoding from another non-Unicode character encoding. Some strings that can be represented in Unicode cannot in the other encoding, but this does not mean that they are invalid for Unicode, it just mean that thoese sequences are impossible or forbidden in the other encoding:

    If you cannot input a capital Y with diaeresis on a french keyboard in legacy applications, despite it is a valid and existing French letter, the input may be rejected, but this certainly does not mean that it is invlid in Unicode, and not even invalid in French!

    Take the same reasoning with Thai (remember that Unicode only encodes scripts, not writing systems, and not languages): technical limitations and linguistic restrictions are not part of the Unicode standard.

    This archive was generated by hypermail 2.1.5 : Sat Feb 10 2007 - 13:25:35 CST