Re: "Universal Character Set"

From: Asmus Freytag (
Date: Sat Feb 17 2007 - 19:32:56 CST

  • Next message: Jukka K. Korpela: "RE: "Universal Character Set""

    On 2/17/2007 2:55 PM, Mark Davis wrote:
    > At least for English speakers, I've found a strong anecdotal
    > correlation between those who say UCS or ISO 10646 and those who say
    > "octet" instead of byte.
    > *73,600,000* for *byte
    > **7,650,000* for *octet*
    > As with your case, the problem is separating out the non-computer usage.
    > *29,300,000* for * byte
    > <
    > computer
    > <
    > **1,030,000* for * octet
    > <
    > computer
    > <
    > *
    > Mark
    And 14 million for "byte" plus "bit"
    and 1 million for "octet" plus "bit"

    But wait, the story is more complex than that.

    Searching for 10646 together with either "octet" or "byte" gives a 1:1
    ratio of usage
    Searching for UCS together with either "octet" or "byte" gives a 1:2
    ration of usage
    Searching for "character" together with either "octet" or "byte" gives a
    1:3 ratio of usage
    Searching for Unicode together with either "octet" or "byte" gives an
    almost 1:4 usage

    A significant fraction of all of these must use both octet and byte,
    because searching
    for both of them gives a number of hits that are relatively high
    compared to the number
    of hits for any of the combinations.

    Interestingly enough, the number of non-English pages using octet/bit is
    about 10%, while
    for byte/bit the overwhelming number of hits are from non-English sources..
    For languages the numbers do not add up, some documents must be
    classified as belonging
    to multiple (or all) languages, but the huge ratio in the case of byte
    bit is interesting.

    > *
    > <
    > On 2/17/07, *Asmus Freytag* <
    > <>> wrote:
    > On 2/17/2007 9:58 AM, Don Osborn wrote:
    > >
    > > Does anyone currently use the term "Universal Character Set"
    > (UCS) to
    > > refer to Unicode/ISO-10646? I guess it is technically correct, but I
    > > rarely see it. It seems that folks generally use "Unicode" as the
    > > catch-all term, or maybe I'm missing a wider use of UCS?
    > >
    > I believe your observation about "Unicode" being the common label
    > are to
    > the point. A bit of research is illuminating and might explain
    > some of
    > the reasons why the term has caught on.
    > There are about 33 million pages indexed on Google that can be
    > retrieved
    > by a search for "Unicode" and about 111,000 by a search for "Universal
    > character set". If you subtract all pages that mention 10646 or
    > Unicode
    > or UCS that number drops to 1/10th fir the altter. If you similarly
    > subtract the other terms from the search for Unicode, there's hardly a
    > reduction in number.
    > What that means is that "universal character set" is probably most
    > often
    > used as a descriptor, as in "Unicode is a universal character
    > set", and
    > not as a label. The common label is clearly "Unicode". That's not
    > surprising, because Unicode as a label has the advantage of being
    > shorter and clearly referring to a specific character set.
    > In the case of UCS as a label, you run into the problem that the
    > letters
    > UCS are not unique. Google will pull up the Union of Concerned
    > Scientists, UCS Inc., University College School and a number of
    > others
    > on the first screen (and also helpfully suggest that you really meant
    > USC). Trading non-distinctiveness for brevity is apparently not a
    > clear
    > win - and the use of UCS (in all meanings) is barely 1/6th of the one
    > for Unicode. If you search for UCS together with 10646 or Unicode to
    > sift out when UCS might have been used in the context of character
    > sets,
    > you find only about 800K inks, which only emphasizes the issue
    > with the
    > multiple meanings of UCS.
    > 10646 by itself gives about 4.5 million hits, of which fully 1/3 don't
    > mention ISO, but are in reference to part numbers or are otherwise
    > false
    > positives--based on that you can conclude that 10646 is used as a
    > designator of the character set about 1/10th as often as Unicode.
    > There are instances where referring to Unicode is the only correct
    > choice. For example, when referring to Unicode Normalization Forms,
    > Unicode Bidi Algorithm, Unicode Line Breaking, and the myriad other
    > specifications that have been developed or are being developed around
    > the character set and collection of character properties by the
    > Unicode
    > Consortium.
    > A./
    > --
    > Mark

    This archive was generated by hypermail 2.1.5 : Sat Feb 17 2007 - 19:35:21 CST