Normalization Implementation Tricks

From: Michael D. Adams (mdmkolbe@gmail.com)
Date: Wed Feb 11 2009 - 20:33:20 CST

Next message: Doug Ewell: "Re: Draft proposal for inclusion of the Chinook script in Unicode"

Previous message: vanisaac@boil.afraid.org: "Draft proposal for inclusion of the Chinook script in Unicode"
Next in thread: Henrik Theiling: "Re: Normalization Implementation Tricks"
Reply: Henrik Theiling: "Re: Normalization Implementation Tricks"
Reply: Richard Ishida: "RE: Normalization Implementation Tricks"
Reply: Mark Davis: "Re: Normalization Implementation Tricks"
Reply: Martin v. Löwis: "Re: Normalization Implementation Tricks"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

How do people efficiently implement the (re-)composition table used by
the normalization algorithm for NFC and NFKD? (I am writting a
library for a project.)

The most naive implementation would be a table indexed by a starter
character and a combining character. Of course that is completely
unreasonable as it would require 0x110000 * 0x110000 entries (a few
terabytes).

If I understand right, ICU library uses shared tries (as the Unicode
spec suggests) indexed by the starter character that point to lists of
combining character and result pairs (an association list in
Lisp/Scheme terminology). This should reduce the size requirements,
but now there a list we have to scan through which could increase
run-time access cost.

Are there any other implementation methods that have a small memory
footprint (~10-20kb) and quick access (~ 10-20 instructions)? Any
guidance in this regard would be appriciated.

Michael D. Adams
mdmkolbe@gmail.com

Next message: Doug Ewell: "Re: Draft proposal for inclusion of the Chinook script in Unicode"
Previous message: vanisaac@boil.afraid.org: "Draft proposal for inclusion of the Chinook script in Unicode"
Next in thread: Henrik Theiling: "Re: Normalization Implementation Tricks"
Reply: Henrik Theiling: "Re: Normalization Implementation Tricks"
Reply: Richard Ishida: "RE: Normalization Implementation Tricks"
Reply: Mark Davis: "Re: Normalization Implementation Tricks"
Reply: Martin v. Löwis: "Re: Normalization Implementation Tricks"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Feb 11 2009 - 20:37:42 CST