Re: Nicest UTF

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon Dec 13 2004 - 16:19:07 CST

Next message: Philippe Verdy: "Re: Roundtripping in Unicode"

Previous message: Magda Danish \(Unicode\): "FW: Subj: Displaying Chinese characters and Chu Nom characters"
In reply to: D. Starner: "RE: Nicest UTF"
Next in thread: Lars Kristan: "RE: Nicest UTF"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

From: "D. Starner" <shalesller@writeme.com>
>> Some won't convert any and will just start using UTF-8
>> for new ones. And this should be allowed.
>
> Why should it be allowed? You can't mix items with
> different unlabeled encodings willy-nilly. All you're going
> to get, all you can expect to get is a mess.

When you say "you can't", it's excessive when speaking about filesystems,
which DO NOT label their encoding, and allow multiple users to use and
create files on shared filesystems with different locales having each a
differnt encoding.

So it does happen that the same filesystem stores multiple encodings for its
filenames. It also happens that systems allow mounting remote filesystems
shared on systems using distinct system encodings (so even if a filesystem
is consistent, these filenames appear with various encodings, and this goes
to more complex situations when they are crosslinked with soft links or
URLs.

Think about the web: it's a filesystem in itself, which uses names (URLs)
include inconsistent encodings. Although there's a recommandation to use
UTF-8 in URLs, this is not mandatory, and there are lots of hosts that use
URLs created with some ISO-8859 charsets, or even Windows or Macintosh
codepages.

To resolve some problems, HTML specifications allow additional (but
out-of-band) attributes to resolve the encoding used for resource contents,
but this has no impact on URLs themselves.

The current solution is to use "URL-encoding" and treat them as binary
sequences with a restricted set of byte values, but this time it means
transforming what was initially plain-text into some binary moniker.

Unfortunately, many web search engines do use the URLs to qualify the
pertinence of search keywords, instead of treating them only as blind
monikers.

Lots has been done to internationalize the domain names for use in IRIs, but
URLs remain a mess and a mixture of various charsets, and IRIs are still
rarely supported on browsers.

The problem with URLs is that they must be allowed to contain any valid
plain-text, notably for Form-Data, submitted with a GET method, because this
plain-text data becomes part of a query-string, itself part of the URL. HTML
does allow specifiying in the HTML form which encoding should be used for
this form data, because servers won't always expect a single and consistent
encoding; the absence of this specification is often interpreted in browsers
as meaning that form-data must be encoded with the same charset as the HTML
form itself, but not all browsers observe this rule (in addition many web
pages are incorrectly labelled, simply because of incorrect or limited HTTP
server configurations, and the standards specify that the charset specified
in the HTTP headers have priority to the charset specified in encoded
documents themselves; this was a poor decision, which is inconsistent with
the usage of the same HTML documents on filesystems that do not store the
charset used for the file content)...

So don't think that this is simple. It is legitimate to be able to refer to
some documents which we know are plain-text, but have unknown or ambiguous
encodings (and there are many works related to the automated identification
of lguage/charset pairs used in documents; none of these method are 100%
exempt of false guesses).

For clients trying to use these resources with ambiguous or unknown
encodings, but that DO know that this is effectly plain-text (such as a
filename), the solution to eliminate (ignore, not show, discard...) all
filenames or documents that look incorrectly encoded may be the worst
solution: it gives no information to the user that these documents are
missing, and this does not allow these users to even determine (even if
characters are incorrectly displayed) which alternate encoding to try. It's
legitimate to think about solution allowing at least partial representation
of these texts, so that the user can look at how it is effectively encoded
and get hints about how to select the appropriate charset. Also, very lossy
conversions (with U+FFFD) are not satisfying enough.

Next message: Philippe Verdy: "Re: Roundtripping in Unicode"
Previous message: Magda Danish \(Unicode\): "FW: Subj: Displaying Chinese characters and Chu Nom characters"
In reply to: D. Starner: "RE: Nicest UTF"
Next in thread: Lars Kristan: "RE: Nicest UTF"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Dec 13 2004 - 17:27:15 CST