Re: FYI: Google blog on Unicode

From: Mark Davis ☕ (mark@macchiato.com)
Date: Fri Jan 29 2010 - 11:34:47 CST

  • Next message: Mark Davis ☕: "Re: FYI: Google blog on Unicode"

    We separate out pure ASCII pages because they are a subset of most
    encodings on the web (Latin 1, etc). That way the graph is a pure
    partition.

    If we didn't, the UTF-8 pages amount to > 65%, but then the
    Latin1/cp1252 are > 40%, SJIS is > 20%, etc. which would be
    misleading.

    Mark

    On Fri, Jan 29, 2010 at 08:44, karl williamson <public@khwilliamson.com> wrote:
    > Mark Davis ☕ wrote:
    >>
    >> FYI, they managed to use the larger image before most people saw it.
    >>
    >> Mark
    >>
    >>
    >>
    >> On Fri, Jan 29, 2010 at 07:06, Mark Davis ☕ <mark@macchiato.com> wrote:
    >>>
    >>> It is encodings determined by a detection algorithm. The declarations
    >>> for encodings (and language) are far too unreliable to be depended on.
    >>> The detection algorithm itself is fairly complex, but quite fast and
    >>> compact.
    >>>
    >>> Mark
    >>>
    >>>
    >>>
    >>> On Thu, Jan 28, 2010 at 21:38, Simon Montagu <smontagu@smontagu.org>
    >>> wrote:
    >>>>
    >>>> On 28/01/2010 10:50, Mark Davis ☕ wrote:
    >>>>>
    >>>>> There's a blog on Unicode that people may find interesting:
    >>>>> http://googleblog.blogspot.com/2010/01/unicode-nearing-50-of-web.html
    >>>>>
    >>>>> (The graph on Unicode is too small; until they get that fixed, I have
    >>>>> the large one on http://www.macchiato.com/)
    >>>>>
    >>>>> Mark
    >>>>
    >>>> What exactly is this counting? Encodings declared internally in
    >>>> web-pages?
    >>>> Encodings declared in HTTP headers? Encodings determined by
    >>>> auto-detection?
    >>>> Some combination of the above?
    >>>>
    >>>> --
    >>>> Simon Montagu
    >>>> Mozilla internationalization
    >>>> סיימון מונטגיו
    >>>>
    >>>>
    >>
    >>
    >
    > Since ASCII is a proper subset of utf8, this means effectively that 2/3 of
    > the web is using utf8; up from about 57% in 2001.  So the sum of the two has
    > a much shallower slope.
    >
    > Since the two are distinguished, I'm guessing that many more web pages have
    > at least one non-ascii character on them than there used to be??
    >
    >



    This archive was generated by hypermail 2.1.5 : Fri Jan 29 2010 - 11:37:18 CST