The project can be found here: <a href="https://github.com/mxmlnkn/indexed_bzip2" rel="nofollow">https://github.com/mxmlnkn/indexed_bzip2</a><p>The readme shows how to use it from Python, via ratarmount, and via its command-line tool ibzip2.<p>There are really a lot of bzip2 implementations out there but few that have been parallelized, even fewer with random access, and, to my knowledge, none of which are both parallel and offer random seeking in constant time.
This is why I started indexed_bzip2. It is a header-only C++ library with Python bindings so that I can use it as a backend from ratarmount. For that, it also supports working with Python file-like objects by using the Python API.<p>For this project, I started with the super-permissively 0-BSD licensed and small bzip2 decoder from the toybox project and extracted what I needed from it.
The architecture for the parallel decoding plus random seeking took me some iterations and contemplation to get right.
It uses block-based caching and prefetching to achieve both of these things.
"Right" means that it is modular enough that I can reason about it and be convinced that it is correct.
Of course, there are also a lot of automated tests with all kinds of sanitizers to check for correctness because I am wary when it comes to reading user data.<p>All in all, this project was a nice brainteaser with which I could grow and I think it is largely finished for now.
The single-core bzip2 decoder might profit from further optimizations but it already outperforms the internal Python bz2 module for 2 cores and more.
And it also outperforms the internal Python bz2 module when it comes to random access because it is in constant time with indexed_bzip2.
All this comes at the cost of higher memory usage (for the block cache), though.
Currently, I'm trying to reuse the infrastructure I built for this to implement parallel random access to gzip files for usage in ratarmount but I'm not sure when that will be finished.