Re: Nicest UTF

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon Dec 13 2004 - 16:19:07 CST

  • Next message: Philippe Verdy: "Re: Roundtripping in Unicode"

    From: "D. Starner" <shalesller@writeme.com>
    >> Some won't convert any and will just start using UTF-8
    >> for new ones. And this should be allowed.
    >
    > Why should it be allowed? You can't mix items with
    > different unlabeled encodings willy-nilly. All you're going
    > to get, all you can expect to get is a mess.

    When you say "you can't", it's excessive when speaking about filesystems,
    which DO NOT label their encoding, and allow multiple users to use and
    create files on shared filesystems with different locales having each a
    differnt encoding.

    So it does happen that the same filesystem stores multiple encodings for its
    filenames. It also happens that systems allow mounting remote filesystems
    shared on systems using distinct system encodings (so even if a filesystem
    is consistent, these filenames appear with various encodings, and this goes
    to more complex situations when they are crosslinked with soft links or
    URLs.

    Think about the web: it's a filesystem in itself, which uses names (URLs)
    include inconsistent encodings. Although there's a recommandation to use
    UTF-8 in URLs, this is not mandatory, and there are lots of hosts that use
    URLs created with some ISO-8859 charsets, or even Windows or Macintosh
    codepages.

    To resolve some problems, HTML specifications allow additional (but
    out-of-band) attributes to resolve the encoding used for resource contents,
    but this has no impact on URLs themselves.

    The current solution is to use "URL-encoding" and treat them as binary
    sequences with a restricted set of byte values, but this time it means
    transforming what was initially plain-text into some binary moniker.

    Unfortunately, many web search engines do use the URLs to qualify the
    pertinence of search keywords, instead of treating them only as blind
    monikers.

    Lots has been done to internationalize the domain names for use in IRIs, but
    URLs remain a mess and a mixture of various charsets, and IRIs are still
    rarely supported on browsers.

    The problem with URLs is that they must be allowed to contain any valid
    plain-text, notably for Form-Data, submitted with a GET method, because this
    plain-text data becomes part of a query-string, itself part of the URL. HTML
    does allow specifiying in the HTML form which encoding should be used for
    this form data, because servers won't always expect a single and consistent
    encoding; the absence of this specification is often interpreted in browsers
    as meaning that form-data must be encoded with the same charset as the HTML
    form itself, but not all browsers observe this rule (in addition many web
    pages are incorrectly labelled, simply because of incorrect or limited HTTP
    server configurations, and the standards specify that the charset specified
    in the HTTP headers have priority to the charset specified in encoded
    documents themselves; this was a poor decision, which is inconsistent with
    the usage of the same HTML documents on filesystems that do not store the
    charset used for the file content)...

    So don't think that this is simple. It is legitimate to be able to refer to
    some documents which we know are plain-text, but have unknown or ambiguous
    encodings (and there are many works related to the automated identification
    of lguage/charset pairs used in documents; none of these method are 100%
    exempt of false guesses).

    For clients trying to use these resources with ambiguous or unknown
    encodings, but that DO know that this is effectly plain-text (such as a
    filename), the solution to eliminate (ignore, not show, discard...) all
    filenames or documents that look incorrectly encoded may be the worst
    solution: it gives no information to the user that these documents are
    missing, and this does not allow these users to even determine (even if
    characters are incorrectly displayed) which alternate encoding to try. It's
    legitimate to think about solution allowing at least partial representation
    of these texts, so that the user can look at how it is effectively encoded
    and get hints about how to select the appropriate charset. Also, very lossy
    conversions (with U+FFFD) are not satisfying enough.



    This archive was generated by hypermail 2.1.5 : Mon Dec 13 2004 - 17:27:15 CST