WG: UTF-8 text files

From: Dominikus Scherkl (lyratelle@gmx.de)
Date: Wed Jun 08 2005 - 04:50:32 CDT

    This Message was intended to go to the whole list (my fault):

    > > > consider the case of the non-breaking space (U+00A0) which may
    > > > follow lots of uppercase ISO 8859-1 Letters (U+00C0..U+00DF).
    > >
    > > Remember that Lasse's idea is to check _all_ the text; so
    > > while NBSP certainly can occur after an capital accentuated
    > > letter (or an eszet)
    > But Uppercase accented letters fortunately do not often
    > occure at the end of words, do they? Only (eszet, U+00DF)
    > is likey to occure before NBSP often, because it's a common
    > word-ending in german, but DF A0 to DF BF in UTF-8 means
    > U+07E0 to U+7FF, thus far unassigned codepoints (in the near
    > future a N'Ko letters), that are realy unlikey to occure in
    > the middle of german words.
    > More of a Thread is '' (C2) followed by some punctuation
    > like NBSP (A0), '' (AB) '' (BB), '' (BF) or '' (A1),
    > which stand for themthelves in UTF-8. So Words ending in ''
    > may be missinterpreted by simply swallowing the letter. This
    > may be realy hard to detect. But as stated above, uppercase
    > accented letters are very uncommon word endings, and text
    > containing accented letters are very, verys unlikely to
    > contain them _only_ in such uncommon positions.
    > Best Regards.
    > --
    > Dominikus Scherkl

