French encoding [Was: Chapter on character sets]

Date: Thu Jun 15 2000 - 10:04:08 EDT

> - The language assignments for the various parts are largely correct,
> but:
> * French encoding in 8859-1 has been officially deprecated in favor of
> 8859-15,

Assuming that ISO 8859 is anything official (which is not what I think),
then (use of) Latin-1 (and Latin-3 & Latin-5) have not been deprecated, but
rather a note have been added in the text of these standards to explain that
there are small restrictions. Then, a note in a informative appendix
(which means this is nothing official) explains that only Latin-9 have all
the necessary characters to completely cover French orthography. These
characters are seldom used, and more importantly they are not available on
some keyboards (here, some include Microsoft, even in the most recent releases).
By the way, exactly the same point occurs for Finnish.

Bottom line: Lars Marius' text is certainly the way to go.

> although the reality is that waaaaaaaay more French data is
> encoded as Latin-1 than Latin-9.

How can you say that ? In fact, this is completely wrong, because to qualify
as encoded as Latin-1 instead of Latin-9, data have to contain the characters
¤ ¼ ½ ¾ ´ ¨ ¸ ¦, which are certainly not common, while for the contrary
they have to contain the characters € œ Œ or Ÿ, encoded as per Latin-9,
which is also very uncommon but is smoothly increasing (for example in the
fr.* Usenet hierarchy).

The reality is that waaaaaaay more French data is encoded as CP-1252, by the
way (and then, œ is encoded as two characters or as only one with code \x9C
& \x8C, when some software is smart enough to do the conversion, because as
I wrote they are not on the keyboard).

As Lars Marius pointed out correctly, there is the problem of tagging, which
may lead to believe that iso-8859-1 is more used that in reality.

> * Croatian is encoded using Latin-1 (du behver ikke spr Sylvester)

? Probably difficult to achieve. OTOH, Albanian, as German or Swedish, can be
encoded with either Latin-2 or Latin-1.


