Re: Query for Validity of Thai Sequence

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Fri Feb 09 2007 - 19:27:12 CST

  • Next message: William J Poser: "missing symbol?"

    From: "Richard Wordingham" <richard.wordingham@ntlworld.com>
    > Lokesh Joshi wrote on Thursday, February 08, 2007 4:45 AM
    > Subject: Query for Validity of Thai Sequence
    >
    >
    >> If possible can anyone pls confirm that the thai unicode sequence:
    >>
    >> U+0E25 (THAI CHARACTER LO LING) U+0E37 (THAI CHARACTER SARA UEE)
    >> U+0E4C(THAI CHARACTER THANTHAKHAT)
    >>
    >> is a valid sequence, as far i have been knowing thai this seems to be an
    >> invalid sequence, only in above vowels, SARA I (U+0E34) is valid before
    >> THANTHAKHAT.

    In Unicode, ANY sequence of valid non-surrogate codepoints is valid (even those sequences that contain non-characters). What is invalid belongs to the limited cases with unpaired surrogates, out of range codepoints (outside the 17 planes), and with invalid UTF-* sequences (of bytes or of code units). What is invalid is to handle canonical equivalence differently than what is exposed. Note that Unicode defines "conforming" processes for those that "transform" those sequences of codepoints, but the UTF-* encodings are not transforming strings (normalization is transforming strings so that it no longer equates to the original at the level of codepoint streams).

    Gien that you use the U+xxxx notation, you are questionning about the validity of a sequence of codepoints. As all these codeponts are valid individually, the sequence is valid, and can be successfully encoded and decoded with all compliant UTF-* transforms or any compliant encoding.

    What may be intrigating is that the Thai TIS standard may have restricted the validity of those strings when they are encoded with this national standard (not sure about that).

    Your question is similar to asking if the sequence string "qzlkqw" is valid using ony Latin consonnants; I have doubt that there exists any actual human language written with the Latin string that includes such sequence in a word, given that it can't be spelled orally, but anyway it is perfectly valid in ASCII, as well as in its equivalent Unicode representation.



    This archive was generated by hypermail 2.1.5 : Fri Feb 09 2007 - 19:30:46 CST