Re: Japanese EUC and Shift-JIS text samples?

From: Yung-Fong Tang (ftang@netscape.com)
Date: Mon Oct 04 1999 - 19:35:20 EDT


Frank da Cruz wrote:

> > HTML could also be treat as plain text from converter point of view,
> > right ?
> >
> In a way, but it has a lot of ASCII characters that would not normally
> be found in Japanese text. Plain Japanese text without markup would be
> better.

It will be better if you care the frequency of the char and performance .
It won't be different if you only care about the correctness of conversion

> Maybe some kind of "full text" archive of literature such as
> we have in the USA at university libraries?

Another thing you can do (a lot of manu operation). View the page with
Netscape, and select "Save As", and when the "Save As" dialog box show up,
select "Plain Text". It will strip out the HTML markup for you.

>
>
> How about newsgroup archives? (I think JIS-7 is used for newsgroups?
> Or ISO 2022-JP?)

ISO-2022-JP IS the JIS-7.

>
> > http://home.netscape.com/ja for Shift_JIS
> > http://www.yahoo.co.jp/ for EUC-JP
> >
> Either my Shift-JIS parser is wrong, or none of these web pages has
> any halfwidth Katakana.

I don't think halfwidth Katakana are used frequently in web site. In
theory, any Katakana could use the half width one, but usually only menu in
GUI use it to save some screen space.

> But since I have only a USA Windows-95 PC
> with Netscape for viewing, all I see is little boxes anyway, so I have
> no way of knowing what I'm looking at (even if I *could* read
> Japanese :-)

Come on. You can read Japanese page in your US window with Netscape. It's
there for years. Read-
 How to View Chinese/Japanese/Korean HTML with Netscape Communicator on US
Win95 and NT (NT3.51 inclued)
http://people.netscape.com/ftang/communicatorfont.html

>
>
> Thanks!
>
> - Frank





This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:53 EDT