New Implementation of Tag Handling

The most severe peformance problem is the way how the library is handling the tagging of characters. The old implementation of this tag handling has been completely eliminated, and a new tag handling has been implemented. The new implementation is not using toggle segments anymore, alternatively all content segments (chars, hyphens, images, windows) now are attributed with a tag set. Furthermore all lines are providing tag information for the whole line. This method has some advantages:

  1. Lookup of tags now will be done in (quasi) constant time – in fact the worst case is O(log(# of tags)), but in practice this is irrelevant – , the old implementation needs polynomial time.

  2. The tag summaries are not needed anymore – and with the elimination of the summaries a memory leak has also been eliminated. But the B-Tree will still be used for the acceleration of search operations.

Implementation Details

The tag information will be stored in a set, in general it's a bit field. Normally an application will only use a quite low number of tags, but if an application needs a large number of tags it may happen that the memory usage will explode with the use of bit fields, therefore I've implemented a virtual set struct. As soon as a specific threshold (currently 512 tags) will be exceeded, an integer array will be used. Both approaches (bit field, integer set) are approximately using the same amount of memory as the old implementation with toggles. The bit field operations are super-fast, and the operations on integer arrays are moderate in time; consider that in practice it is rather seldom that an application needs more than 512 tags. The conversions between bit fields and integer arrays, depending on the number of tags, will be computed dynamically on demand. Also the size of the bit field (the integer array don't need a size adjustment), depending on the number of tags, will be adjusted on demand.

With the new implementation of internal tag handling one important thing has been changed: in prior implementation insertion and removal was quite cheap, but the lookup of tagging information was too expensive. In new implementation this has been reversed, now the lookup is super-fast, but insertion and removal now is more expensive, especially if the tag ranges are large.

One performance problem remains: tagging large ranges of text is very expensive. This performance problem is solvable, but not without a loss of performance about 50-70% in display stuff. Due to my experience it is quite seldom that large ranges of text will be tagged, so I think that living with this performance issue is acceptable.