Re: Amusing use of Unicode

From: John Burger (john@mitre.org)
Date: Fri Mar 13 2009 - 14:00:39 CST

  • Next message: announcements@unicode.org: "[Unicode Announcement] New Public Review Issue #144: Proposed Update UAX #42: Unicode Character Database in XML"

    Clark S. Cox III wrote:

    >> In UTF8, the latter two are exactly the same size, 17 bytes, so
    >> tinyarro doesn't save you any space in Twitter, say.
    >
    > Yes it does. Twitter doesn't count bytes, it counts characters. The
    > '➡' counts as a single character towards the 140 character limit:
    >
    > <http://twitter.com/clarkcox/status/1323106411>

    Cool! This may have changed recently, see this pronouncement from the
    development team in January:

    http://groups.google.com/group/twitter-development-talk/browse_thread/thread/44be91d5ec5850fa

    You tweeted using the web interface, which apparently worked, but it's
    certainly the case that there are other Twitter clients that don't
    know about UTF8 in particular, and (perhaps unnecessarily) truncate
    after 140 =bytes=. Some SMS services (whence Twitter got it's 140-
    whatever limit) transfer non-ASCII in UTF-16, so the limit there is 70
    Unicode characters.

    On a related note, there are also apparently some bugs in the way the
    Twitter backend stores text, such that sometimes tweets get truncated
    after the fact, as the data migrates deeper into their backing store:

    http://groups.google.com/group/twitter-development-talk/browse_thread/thread/9d9d16d55e2e1e67

    You may want to check that tweet in a few days time to see if the
    arrow is still there.

    Isn't i18n fun?

    - John D. Burger
       MITRE



    This archive was generated by hypermail 2.1.5 : Fri Mar 13 2009 - 14:02:30 CST