The Hacks of Life: Compressed Vectors

Monday, December 22, 2008

Compressed Vectors - Part 1

Normally for data that has a lot of repeating values, I use run-length encoding - it works particularly well when the data is accessed from start to end in forward order. For example, the spot lighting masks for X-Plane's 2-d panels are RGBA image that are RLE compressed. Continuous scan lines of transparent window become a single RLE line segment. This actually makes processing faster, because the decision to skip transparent areas is made once per scan line, not once per pixel.

Things get a little bit trickier if we need random access to the data. The structure I am playing with now provides something similar to that. It is a vector where continuous values are stored once in runs. It has two parts:

A run index (stored as a sorted vector of key-value pairs) maps the beginning of a run (in the virtual data) to the start of storage (in a real data vector).
A real data vector contains each value as few times as possible.

We don't need to store "how long is this run" in the data vector - the index gives it to us. We just look at the next index key-value pair and see (1) how long is our run and (2) how much data is behind it. If only one value backs our 5-value run, we are run-length compressed.

Complexity properties: call N the number of items, U the number of unique items, and R the number of runs (that is, the number of times we change from RLE to non-RLE data.

In the worst case (totally non-sequential data), the storage is the same as vector: O(N). In the best case (all the same) we are O(1). In both cases, R=1.
Generally our storage is O(R+U), which should be less than O(N).
Random access is O(log(R)) - not quite constant time, but pretty fast. For the worst case random data, R is 1, so we can be fast that way.
Pushing data onto the end is constant time plus any cost of memory reallocation; since the vector is compressed, hopefully that is less.
Deleting data from the front is O(R+U) instead of O(N), which is a bit of an improvement.

At this point the unit tests work, but I don't have stats on real-world compression.

The Hacks of Life

Monday, December 22, 2008

Compressed Vectors - Part 1

No comments:

Post a Comment