RE: detecting encoding in plain text (related to utf8)

From: Deepak Chand Rathore (deepakr@aztec.soft.net)
Date: Wed Jan 14 2004 - 01:21:25 EST

  • Next message: D. Starner: "Re: Detecting encoding in Plain text"

            

    Hi all,

    Great to hear so many views on detecting encoding
    I would also like to share something related to detecting UTF8 encoding
    As most of u would be knowing, we can check any stream of bytes for utf8
    encoding, if any of the following sequence of bytes appears.
    If not , we simply consider it not to be in utf8

                                            unicode range
    utf 8 encoded bytes
    U-00000000 - U-0000007F: 0xxxxxxx
    U-00000080 - U-000007FF: 110xxxxx 10xxxxxx
    U-00000800 - U-0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
    U-00010000 - U-001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
    U-00200000 - U-03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

    U-04000000 - U-7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
    10xxxxxx
    similarly using the above principle , we can write our own function that
    converts wide char to utf8 & vice versa
    according to me , this will work. ( am i right ??)
    This approach will surely help as we don't have to rely on the library (for
    eg. some utf8 functions require that the locale to be set to xxx.UTF-8
    locale, so dependency on such locale)

    But, there is one concern. In some cases the utf8 byte stream starts with a
    BOM,( for eg. when we try reading bytes from a text file that
    is saved using notepad (using utf8 option )in WIN2k, after first few bytes(
    i suppose first 3 bytes), the actual text start.
    So how do we detect whether the byte stream starts with a BOM or not ??
    or the first few bytes represent BOM or the actual text ??

    with regards
    ( DC )
    deepak chand rathore



    This archive was generated by hypermail 2.1.5 : Wed Jan 14 2004 - 02:08:43 EST