RE: detecting encoding in plain text (related to utf8)

From: Deepak Chand Rathore (deepakr@aztec.soft.net)
Date: Wed Jan 14 2004 - 01:21:25 EST

Next message: D. Starner: "Re: Detecting encoding in Plain text"

Previous message: Don Osborn: "Re: New MS Mac Office and Unicode?"
Next in thread: Doug Ewell: "Re: detecting encoding in plain text (related to utf8)"
Reply: Doug Ewell: "Re: detecting encoding in plain text (related to utf8)"
Reply: Markus Scherer: "Re: detecting encoding in plain text (related to utf8)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Hi all,

Great to hear so many views on detecting encoding
I would also like to share something related to detecting UTF8 encoding
As most of u would be knowing, we can check any stream of bytes for utf8
encoding, if any of the following sequence of bytes appears.
If not , we simply consider it not to be in utf8

unicode range
utf 8 encoded bytes
U-00000000 - U-0000007F: 0xxxxxxx
U-00000080 - U-000007FF: 110xxxxx 10xxxxxx
U-00000800 - U-0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
U-00010000 - U-001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
U-00200000 - U-03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

U-04000000 - U-7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
10xxxxxx
similarly using the above principle , we can write our own function that
converts wide char to utf8 & vice versa
according to me , this will work. ( am i right ??)
This approach will surely help as we don't have to rely on the library (for
eg. some utf8 functions require that the locale to be set to xxx.UTF-8
locale, so dependency on such locale)

But, there is one concern. In some cases the utf8 byte stream starts with a
BOM,( for eg. when we try reading bytes from a text file that
is saved using notepad (using utf8 option )in WIN2k, after first few bytes(
i suppose first 3 bytes), the actual text start.
So how do we detect whether the byte stream starts with a BOM or not ??
or the first few bytes represent BOM or the actual text ??

with regards
( DC )
deepak chand rathore

Next message: D. Starner: "Re: Detecting encoding in Plain text"
Previous message: Don Osborn: "Re: New MS Mac Office and Unicode?"
Next in thread: Doug Ewell: "Re: detecting encoding in plain text (related to utf8)"
Reply: Doug Ewell: "Re: detecting encoding in plain text (related to utf8)"
Reply: Markus Scherer: "Re: detecting encoding in plain text (related to utf8)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Jan 14 2004 - 02:08:43 EST