There is nothing particular about HTML here. This is a
characteristic of all ASCII-based tagging schemes. The
point is that the *entire* text, including the tags, is
in the native character set. In other words, the text,
with markup, is still plain text. There is no character
set shift involved.
This works because the native character sets are cleverly
devised to always include ASCII (or at least the subset
of the ASCII repertoire used for tagging) in their
> I've got a question about HTML that I hope someone can help me with.
> When html contains native encoding of
> text, the <tags> are in ascii, it appears. How does a program
> (that wants to translate the native stuff to another encoding) find
> the native encoding stuff amidst the tags? I.e. it looked like
> the HTML was like:
^3E88EA9553897E (">ippyakuen") in Japanese
> where XX, YY, and ZZ were 2-byte quantities, ie. some wide character
> encoding. Now the real question - how does it know where the encoding
> starts? E.g. if the HTML contained:
> <b><anothertag> XXYYZZ
^3E2088EA9553897E ("> ippyakuen") in Japanese
How you find and interpret the 0x20 as a SPACE character and not
part of the following two-byte characters is exactly as you would
do so for interpreting any Shift-JIS (in this case), or any other
native character set, SBCS or DBCS, or UTF-8 or Unicode, for that
> would the extra space after the end of the tag mean that the first
> "wide" character in that input would be " X" (and the second "XY")?
> It's not clear how a conversion program (that needs to skip the tags
> (which appear to always be in ascii)) and convert the text) can find
> the text.
> Chuck Wrobel
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:34 EDT