[unicode] Re: UTF-c

From: mpsuzuki@hiroshima-u.ac.jp
Date: Tue Feb 22 2011 - 11:56:32 CST

  • Next message: Doug Ewell: "RE: [unicode] Re: UTF-c"

    On Tue, 22 Feb 2011 17:26:16 +0100
    Philippe Verdy <verdy_p@wanadoo.fr> wrote:
    >Yes there's currently a sync problem with 2-byte encoded characters
    >(if one byte gets deleted), but they occur in a Unicode range
    >(0x80..0x407F) where they extremely rarely occur in overlong sequences
    >(this range is used by scripts that also abondantly use spaces and
    >ASCII punctuations, in addition to controls and line-breaks), so the
    >need to resynchronize on newlines is already satisfied.

    Thank you for pointing it out.

    The resynchronization on newline (or on ASCII punctuation)
    is needed, but I think today it is becoming insufficient
    gradually. The most writing systems using the characters in
    0x80..0x407F use ASCII punctuations too, but some of them
    don't insert ASCII punctuations between the words (Chinese
    and Japanese often use Latin-derived but non-ASCII punctuation
    codepoints, and, Thai writing system inserts ASCII space
    between the sentences but not between the words). Now, I
    often receive a mail message that a newline only appears at
    the end of a paragraph. In such writing system without
    interword ASCII spaces, the sync-on-ASCII cannot prevent
    the breaking a sentence. Sometimes a paragraph could be lost.


    This archive was generated by hypermail 2.1.5 : Tue Feb 22 2011 - 11:59:44 CST