Re: Google is [++dumb]

From: Daniel Biddle (deltab@osian.net)
Date: Mon May 07 2001 - 05:43:08 EDT


On Mon, 7 May 2001, Jungshik Shin wrote:

[...]
> You're assuming that the average user would use a MS IE/Netscape
> under MS-Windows, which I guess is true. However, it has to be also noted
> that under MacOS and Unix/X11 users don't have to do anything other than
> installing fonts for scripts/languages of interest (which can be *argued*
> to be often the case 'by default') Netscape (whether 4.x or 6.x)/Mozilla
> picks up whatever *collection/set* of fonts(instead of a single huge
> Unicode/ISO 10646 fonts) necessary to render UTF-8 encoded pages.

It should also be noted that it's possible to do something useful with
characters that can't be directly rendered. For instance, my copy of Links
can transliterate Greek and Cyrillic characters into Latin, and so display
them on ASCII-only screens. If I knew the languages, I expect I could
read, without too much difficulty, the Greek and Russian samples at:

http://www.columbia.edu/kermit/utf8.html

[...]
> It would be nice if Google can be customizable per user so that
> users can specify whether they want UTF-8 encoded search results or not.
> Well, we can't register our preference at every web site we visit(as
> you wrote), but at some of sites we frequent we can do if necessary and
> possible.
>
> Hmm, then I'm wondering if there's any http 'mechanism'
> by which we can tell the server in which encoding(s) we want to receive
> the result in what preference order (like we do with 'languages')

There is. HTTP 1.1 (RFC2616) supports an 'Accept-Charset' header:

# 14.2 Accept-Charset
#
# The Accept-Charset request-header field can be used to indicate what
# character sets are acceptable for the response. This field allows
# clients capable of understanding more comprehensive or special-
# purpose character sets to signal that capability to a server which is
# capable of representing documents in those character sets.

It works in a similar way to Accept-Language.

Apache supports it with its AddCharset and AddDefaultCharset directives,
and uses it in content negotiation: if asked for index.html, it can return
index.html.ja.sjis or index.html.ja.utf-8, for instance.

http://httpd.apache.org/docs/mod/mod_mime.html#addcharset
http://httpd.apache.org/docs/mod/core.html#adddefaultcharset
http://httpd.apache.org/docs/content-negotiation.html

I know that Google does at least some content negotiation based on
language, but in my experiments just now I was unable to get UTF-8 from
it. (It's much better than it once was, though: I remember when Google
used to classify e-acute as a symbol and drop it from search terms.)

-- 
Daniel Biddle <deltab@osian.net>



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:18:16 EDT