The following improvements could be used to the in-place compression:
a) Avoid storing sp(n) to store trailing padding for the last field. Simplest would be to use bytesPerOffset=0 to indicate all paths have the same continuation? Slightly more verbose b) Optimize the storage of option('0'..'9') or 'A'..'Z'. Use the one of the spare bits to mean rest of values are calculated from the lower comparison value. (can optimize the search to index...) b2) Could use the other spare bit to allow a single null option i.e. ' ', or '0'..'9'
c) Use the two spare opcode to match common strings. 1..31 could be used for very common strings and 32..287 for less common. Examples could be common names (JOHN,SMITH,...) common parts of addresses (PARK,BRIDGE,PO BOX,...) - does not need to be complete words. Would be worth analysing some indexes to auto generate a candidate list.
a,b are particularly good for leaves (and branches for phone numbers) (c) is good for trailing fields in branches.
I estimate it may possibly remove another 1-2% depending on the index - although possibly more.
(d) Compress spaces and zero repeats more aggressively. Currently they are only compressed if >3, but they would be smaller with >= 2. Would save < 0.01% on codesv3.
(e) Instead of space repeat and zero repeat use null repeat - it would allow compression over text/binary boundaries and free up another op code for compression. (Saves > .1% on search index)
(f) I'm fairly sure the option indicating duplicates at the end of the keyed portion is ever processed - so there is no point in generating it!
(g) Divide the firstPosition in a branch by the nodeSize (?save 1-2 byte per branch node)
(h) Compress payload lengths if all < 255
(i) Add a flag to indicate filepositions for leafs are linearly ascending (by 1) - for TLKS. Would save ~800 bytes.
(j) Common up options where all the following fields are identical