The Unicode Consortium Discussion Forum

The Unicode Consortium Discussion Forum

 Forum Home  Unicode Home Page Code Charts Technical Reports FAQ Pages 
 
It is currently Mon Sep 01, 2014 4:09 pm

All times are UTC - 6 hours [ DST ]




Post new topic Reply to topic  [ 8 posts ] 
Author Message
 Post subject: How to add a new encoding to ftp://ftp.unicode.org/?
PostPosted: Fri Feb 15, 2013 4:32 pm 
Offline

Joined: Fri Feb 15, 2013 3:13 pm
Posts: 3
I have generated a mapping for CP870 (aka IBM870) that I'd like to donate to Unicode.org so it can be put here:

ftp://ftp.unicode.org/Public/MAPPINGS/

How do I go about doing that? I actually may have a whole bunch of mapping files to donate.

(Note: I tried to attach the file but .txt files are not allowed, apparently! LOL)


Top
 Profile  
 
 Post subject: Re: How to add a new encoding to ftp://ftp.unicode.org/?
PostPosted: Fri Feb 15, 2013 5:53 pm 
Offline
Site Admin

Joined: Mon Nov 30, 2009 5:16 pm
Posts: 137
Location: root@unicode.org
IBM maintains their own official mappings for their codepages, and we don't host any of them here. Please see:
http://www.unicode.org/Public/MAPPINGS/ ... sions.html


Top
 Profile  
 
 Post subject: Re: How to add a new encoding to ftp://ftp.unicode.org/?
PostPosted: Fri Feb 15, 2013 7:41 pm 
Offline

Joined: Fri Feb 15, 2013 3:13 pm
Posts: 3
Yeah, when I was browsing the FTP server I found that document. It has links to a number of places that are completely useless. "IBM has C/C++ and Java" libraries for every character encoding known to man." Gee whiz! Where are they? I searched all over the IBM website and couldn't find squat. Of course, a C, C++, or Java library would be completely useless to me anyway.

Here's why I want to give unicode.org some (FREE) encoding .TXT files: The Python language uses the encoding map files at ftp://ftp.unicode.org/Public/MAPPINGS/ to automatically generate its built-in character encoding files. So by placing a handful of additional .TXT files on ftp.unicode.org we can (widely) expand support for all kinds of character encodings for hundreds of thousands of software programs.

A single, simple action of copying a few .TXT files could make the world a better place. So how about it? Is there someone I can email the files to? I can put them up on the web myself but ftp.unicode.org makes the most sense.


Top
 Profile  
 
 Post subject: Re: How to add a new encoding to ftp://ftp.unicode.org/?
PostPosted: Fri Feb 15, 2013 9:15 pm 
Offline
Unicode Guru

Joined: Tue Dec 01, 2009 2:49 pm
Posts: 189
If the Python language picks up those mappings from that location then I can see a discussion centered around that as being useful.

If the Python language community were to express a desire to have a place where they can publicly deposit additional mappings expressed as TXT flles, then I think the Unicode Consortium should engage them.

I do think your point about libraries versus flat text tables is well taken. These are not directly substitutable one for another, even if the use of libraries (for applications written in a relevant programming language would of course be preferable). Unicode's move originally was aimed at undercutting the tendency for every programmer to write their own mapping, but as in many cases the mapping information also informs the user about the source and nature of the Unicode character, I was never that comfortable with losing the documentation that way.

However, many tables were poorly maintained and had no known user community that could guarantee that they were correct. IBM sensibly didn't want to get into the middle of helping maintain files on a Unicode server.

So yes, just because something is free doesn't mean it doesn't come with issues. To move this forward would require support from the wider Python language community.


Top
 Profile  
 
 Post subject: Re: How to add a new encoding to ftp://ftp.unicode.org/?
PostPosted: Sat Feb 16, 2013 1:20 am 
Offline

Joined: Mon Feb 21, 2011 5:35 pm
Posts: 6
Hello!

liftoff wrote:
Yeah, when I was browsing the FTP server I found that document. It has links to a number of places that are completely useless. "IBM has C/C++ and Java" libraries for every character encoding known to man." Gee whiz! Where are they? I searched all over the IBM website and couldn't find squat. Of course, a C, C++, or Java library would be completely useless to me anyway.


Did you click the ICU link? It will direct you to http://icu-project.org. I'm not sure why it would be useless to you, within the ICU site you can find:

Perhaps the FTP site link should link to the ICU converter explorer. I did have trouble finding the CDRA downloads, I will find out what happened to them.

liftoff wrote:
Here's why I want to give unicode.org some (FREE) encoding .TXT files: The Python language uses the encoding map files at ftp://ftp.unicode.org/Public/MAPPINGS/ to automatically generate its built-in character encoding files. So by placing a handful of additional .TXT files on ftp.unicode.org we can (widely) expand support for all kinds of character encodings for hundreds of thousands of software programs.

A single, simple action of copying a few .TXT files could make the world a better place. So how about it? Is there someone I can email the files to? I can put them up on the web myself but ftp.unicode.org makes the most sense.


I would suggest instead of reading those .TXT files, either using TR#22, or else ICU UCM format. TR#22 was designed to provide an XML interchange format for mapping tables.

Hope this helps,
Steven (IBM and ICU project)


Top
 Profile  
 
 Post subject: Re: How to add a new encoding to ftp://ftp.unicode.org/?
PostPosted: Tue Feb 19, 2013 10:35 am 
Offline

Joined: Fri Feb 15, 2013 3:13 pm
Posts: 3
srloomis wrote:
Hello!

liftoff wrote:
Yeah, when I was browsing the FTP server I found that document. It has links to a number of places that are completely useless. "IBM has C/C++ and Java" libraries for every character encoding known to man." Gee whiz! Where are they? I searched all over the IBM website and couldn't find squat. Of course, a C, C++, or Java library would be completely useless to me anyway.


Did you click the ICU link? It will direct you to http://icu-project.org. I'm not sure why it would be useless to you, within the ICU site you can find:

Perhaps the FTP site link should link to the ICU converter explorer. I did have trouble finding the CDRA downloads, I will find out what happened to them.

liftoff wrote:
Here's why I want to give unicode.org some (FREE) encoding .TXT files: The Python language uses the encoding map files at ftp://ftp.unicode.org/Public/MAPPINGS/ to automatically generate its built-in character encoding files. So by placing a handful of additional .TXT files on ftp.unicode.org we can (widely) expand support for all kinds of character encodings for hundreds of thousands of software programs.

A single, simple action of copying a few .TXT files could make the world a better place. So how about it? Is there someone I can email the files to? I can put them up on the web myself but ftp.unicode.org makes the most sense.


I would suggest instead of reading those .TXT files, either using TR#22, or else ICU UCM format. TR#22 was designed to provide an XML interchange format for mapping tables.

Hope this helps,
Steven (IBM and ICU project)


This was very helpful. I was able to (eventually) find the Subversion URL and checkout the 'icu' repository. That information was surprisingly difficult to find. Clicking the "Source Code Repository" link on the front page of http://site.icu-project.org/ just takes you to the web-based (Trac) browser. The actual Subversion URLs should be easier to find.

Also, it took a very, very long time to checkout the code. It would be wise to modernize and switch to using git. Not just for performance reasons but also to gain all the benefits that come with distributed source control.

Lastly, the link to icu-project.org should be placed directly inside this:

http://www.unicode.org/Public/MAPPINGS/ ... sions.html

The path from that document to icu-project.org is a bit convoluted: You must first load this page:

http://www-01.ibm.com/software/globaliz ... index.html
(That link is titled, "International Components for Unicode (ICU)")

Way down at the bottom of that page is a tiny, three-character link to site.icu-project.org. There's really no point in sending people to that ibm.com page when all of the same information can be found on the direct link to site.icu-project.org. It is an unnecessary (and confusing) link-in-the-middle.

For reference, I'm almost done writing a script that converts .ucm files to .py encoding modules. To my amazement such a script hasn't already been written (couldn't find anything in Google). When I'm done with it I'll upload it to Github along with all the .ucm files pre-converted and ready-to-go. It will be a lot simpler to use than having to deal with the binary/compiled/not-compatible-with-Python3 PyICU module.


Top
 Profile  
 
 Post subject: Re: How to add a new encoding to ftp://ftp.unicode.org/?
PostPosted: Tue Feb 19, 2013 10:55 am 
Offline

Joined: Mon Feb 21, 2011 5:35 pm
Posts: 6
liftoff wrote:
Yeah, when I was browsing the FTP server I found that document. It has links to a number of places that are completely useless. "IBM has C/C++ and Java" libraries for every character encoding known to man." Gee whiz! Where are they? I searched all over the IBM website and couldn't find squat. Of course, a C, C++, or Java library would be completely useless to me anyway.


So following the original link brings you to http://www.ibm.com/developerworks/views ... nloads.jsp and if you scroll down enough you get: "Character Data Conversion Tables... Character Data Conversion Tables provide code point mappings from a specified source code to a specified target code....". I had missed it the first time because I was looking for CDRA. If you unzip the nested zip files, you will get to IBM-870.zip which has mapping data for this codepage in IBM format.


Top
 Profile  
 
 Post subject: Re: How to add a new encoding to ftp://ftp.unicode.org/?
PostPosted: Tue Feb 19, 2013 12:11 pm 
Offline

Joined: Mon Feb 21, 2011 5:35 pm
Posts: 6
liftoff wrote:
srloomis wrote:
Hello!


Did you click the ICU link? It will direct you to http://icu-project.org

...

I would suggest instead of reading those .TXT files, either using TR#22, or else ICU UCM format. TR#22 was designed to provide an XML interchange format for mapping tables.

Hope this helps,
Steven (IBM and ICU project)


This was very helpful. I was able to (eventually) find the Subversion URL and checkout the 'icu' repository. That information was surprisingly difficult to find. Clicking the "Source Code Repository" link on the front page of http://site.icu-project.org/ just takes you to the web-based (Trac) browser. The actual Subversion URLs should be easier to find.


So I filed bug http://bugs.icu-project.org/trac/ticket/9949 and fixed that link so that it points to a repository help page (as the sidebar actually does) instead of directly to trac. That whole section duplicates the sidebar, and should probably go. Feel free to reply on the bug if you have further comments. By the way, in the Trac browser, there's a "Subversion Location" link at top that takes you to SVN. Filed another ticket http://bugs.icu-project.org/trac/ticket/9950 about making commits against a ticket easier to find.

liftoff wrote:

Also, it took a very, very long time to checkout the code.


Well, sure if you check out the entire repo, including all subprojects, including each tagged version, which someone did a few hours ago (perhaps you). What SVN URL did you checkout?

liftoff wrote:
It would be wise to modernize and switch to using git. Not just for performance reasons but also to gain all the benefits that come with distributed source control.


Yeah, I've come to not hate DCVS. But from where I sit, my challenge is migrating a pretty large and complicated set of codebases, including tools supporting all phases of release and change control management, automated build, client support, website updates, etc. It's non trivial. I wrote the tooling which integrates commits with bugs and vice versa, and I ported it from previous code and bug systems (CVS/Jitterbug).

git-svn does seem promising, one possibility would be having a readonly mirror on the website that could be pulled from, maintained on the server side.

Sorry Sarasvati, I know this is way off topic for a unicode forum. But all of the above (including the CVS/Jitterbug part) applies to CLDR and other Unicode tooling.

liftoff wrote:
Lastly, the link to icu-project.org should be placed directly inside this:

http://www.unicode.org/Public/MAPPINGS/ ... sions.html

The path from that document to icu-project.org is a bit convoluted: You must first load this page:

http://www-01.ibm.com/software/globaliz ... index.html
(That link is titled, "International Components for Unicode (ICU)")

Way down at the bottom of that page is a tiny, three-character link to site.icu-project.org. There's really no point in sending people to that ibm.com page when all of the same information can be found on the direct link to site.icu-project.org. It is an unnecessary (and confusing) link-in-the-middle.


The previous link is a page about IBM vendor mappings. The www-01 page is an IBM page. I don't see a problem with the page, but I will try to get that link made more obvious.

liftoff wrote:
For reference, I'm almost done writing a script that converts .ucm files to .py encoding modules. To my amazement such a script hasn't already been written (couldn't find anything in Google).


Had you considered taking UTR#22 format? That's at least a more standardized format, XML parseable.

Would be interesting if you can pass the ICU converter tests through your encoding modules, both from performance and compatibility.

liftoff wrote:
When I'm done with it I'll upload it to Github along with all the .ucm files pre-converted and ready-to-go. It will be a lot simpler to use than having to deal with the binary/compiled/not-compatible-with-Python3 PyICU module.


Please file an ICU bug so we can at least link to your program. Have you contacted the PyICU module maintainers (Andi)? I don't use it heavily, but I hadn't had trouble getting PyICU to work on v2.x.

Thanks again
Steven


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 8 posts ] 

All times are UTC - 6 hours [ DST ]


Who is online

Users browsing this forum: No registered users and 2 guests


Quick-mod tools:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  
cron
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group
Template made by DEVPPL.com