Re: UTF-8 text files

From: Philippe VERDY (verdy_p@wanadoo.fr)
Date: Mon Jun 06 2005 - 14:07:23 CDT

  • Next message: Antoine Leca: "Re: UTF-8 text files"

    > De : "Samuel Thibault" <samuel.thibault@ens-lyon.org>
    > Doug Ewell, le Mon 06 Jun 2005 07:08:15 -0700, a dit :
    > > It is still possible to come up with a plausible example of text that is
    > > both valid UTF-8 and plausible Latin-1, and I need to find one -- not
    > > only because my current example is Windows-specific, but also because
    > > Nestlé is not even a trademark (™) but a registered trademark (®).
    >
    > Just find two registered marks that only differ by an ending  for
    > instance:
    > FOO®

    Instead of focusing on the trademark or registered symbols, just consider the case of the non-breaking space (U+00A0) which may follow lots of uppercase ISO 8859-1 Letters (U+00C0..U+00DF). With ISO-8859-1 you would get sequences like (0xC0,0xA0) to (0xDF,0xA0) which will also be valid UTF-8 sequences. This case is probably less rare than the contrieved example, notably when the non breaking space is used in the middle of a compound-name trademark that should remain unbreakabke, or if these sequences are used in the data of a wide HTML table, whose cells should preferably remain unbreakable (yes HTML offer another way to avoid breaks with the nobreak attribute of table cells, or with CSS, or with the <nobr> container element).
    But for plain-text files, these cases are extremely rare...



    This archive was generated by hypermail 2.1.5 : Mon Jun 06 2005 - 14:08:13 CDT