WG: UTF-8 text files

From: Dominikus Scherkl (lyratelle@gmx.de)
Date: Wed Jun 08 2005 - 04:50:32 CDT

  • Next message: Jon Hanna: "Re: WG: UTF-8 text files"

    This Message was intended to go to the whole list (my fault):

    > > > consider the case of the non-breaking space (U+00A0) which may
    > > > follow lots of uppercase ISO 8859-1 Letters (U+00C0..U+00DF).
    > >
    > > Remember that Lasse's idea is to check _all_ the text; so
    > > while NBSP certainly can occur after an capital accentuated
    > > letter (or an eszet)
    >
    > But Uppercase accented letters fortunately do not often
    > occure at the end of words, do they? Only ß (eszet, U+00DF)
    > is likey to occure before NBSP often, because it's a common
    > word-ending in german, but DF A0 to DF BF in UTF-8 means
    > U+07E0 to U+7FF, thus far unassigned codepoints (in the near
    > future a N'Ko letters), that are realy unlikey to occure in
    > the middle of german words.
    >
    > More of a Thread is 'Â' (C2) followed by some punctuation
    > like NBSP (A0), '«' (AB) '»' (BB), '¿' (BF) or '¡' (A1),
    > which stand for themthelves in UTF-8. So Words ending in 'Â'
    > may be missinterpreted by simply swallowing the letter. This
    > may be realy hard to detect. But as stated above, uppercase
    > accented letters are very uncommon word endings, and text
    > containing accented letters are very, verys unlikely to
    > contain them _only_ in such uncommon positions.
    >
    > Best Regards.
    >
    > --
    > Dominikus Scherkl
    >



    This archive was generated by hypermail 2.1.5 : Wed Jun 08 2005 - 04:52:16 CDT