TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Building data structures that are smaller than an array and faster (in C)

134 pointsby siganakisabout 13 years ago

9 comments

gorsetabout 13 years ago
Wavelet trees are awesome. There are several ways to speed up the rank/select operations, which the article doesn't mention. For example, you can create a simple (popcount, offset) index for each bitmap. Then when you need to find the nth bit (the select operation), you can simply do a binary search to find the offset with the nearest count. Likewise for the access (or the rank operation) you can find the popcount for the nearest offset in the index.<p>I use wavelet tree heavily in my job. They are very useful for compact representations of order preserving collections of strings which supports fast lookups.<p>Update: Sux4j has some interesting implementations of fast select/rank:<p><a href="http://sux.dsi.unimi.it/" rel="nofollow">http://sux.dsi.unimi.it/</a><p><a href="http://sux4j.dsi.unimi.it/docs/it/unimi/dsi/sux4j/bits/Select.html" rel="nofollow">http://sux4j.dsi.unimi.it/docs/it/unimi/dsi/sux4j/bits/Selec...</a>
评论 #3650977 未加载
评论 #3651351 未加载
评论 #3650963 未加载
nadamabout 13 years ago
What kind of compressed data structure I would use depends heavily on so much things (the probability distribution of the data, the probability distribution of different access patterns) that it is very hard to design a data structure for a generic case. Designing for a specific use case is a lot of fun though!<p>For example this: "Use less memory than an array in all circumstances" is just plain impossible if your data distribution is uniform. You just cannot store 1000 32 bit random intergers (preserving their order) in a smaller place than 1000*32 bits; (it comes from from simple combinatorics considerations.)
评论 #3652971 未加载
gregschlomabout 13 years ago
So wait... You have 64 bits keys and 32 bits values. Then why not store the values in the keys? Use the lower 32-bits for the value, and the higher 32 bits to count how many duplicates you have.<p><pre><code> int32_t big_array[2^32] = {0}; /* returns the key */ int64_t write(int_32 value) { big_array[value]++; return big_array[value] &#60;&#60; 32 &#38; value; } /* returns the value */ int32_t read(int64_t key) { return (int32_t) key; } /* all indexes between 1 and the return value are valid higher half values for the key. The lower half is the value itself. */ int32_t seek(int32_t value) { return big_array[value]; } </code></pre> memory: 16GB<p>access: O(1)<p>seek: O(1)<p>Problem solved!<p>:)<p>edit: I'm missing the "Use less memory than an array in all circumstances" requirement, but hey, you talk about "billion of records" in your post.
评论 #3650957 未加载
评论 #3650952 未加载
tabbyjabbyabout 13 years ago
Bit confused here --<p>The author rewrites the initial values as:<p>1 0 2 0 2 3<p>He defines the keys array as:<p>2 3 4 6<p>Using the method defined in the article to find the values of each of the elements of the rewritten array, wouldn't the values be:<p>3 2 4 2 4 6<p>which does not equal the original values of:<p>3 2 4 6 2 6
评论 #3651042 未加载
评论 #3650959 未加载
nkurzabout 13 years ago
Depending on your data set, there might be a way to use Compressed Bit Vectors: <a href="http://sdm.lbl.gov/fastbit/" rel="nofollow">http://sdm.lbl.gov/fastbit/</a><p>General intro: <a href="http://crd-legacy.lbl.gov/~kewu/ps/LBNL-2164E.pdf" rel="nofollow">http://crd-legacy.lbl.gov/~kewu/ps/LBNL-2164E.pdf</a><p>"Word Aligned Hybrid" compression: <a href="http://crd-legacy.lbl.gov/~kewu/ps/LBNL-49627.pdf" rel="nofollow">http://crd-legacy.lbl.gov/~kewu/ps/LBNL-49627.pdf</a><p>Bit maps as alternative to inverted index: <a href="http://crd-legacy.lbl.gov/~kewu/ps/LBNL-61768.pdf" rel="nofollow">http://crd-legacy.lbl.gov/~kewu/ps/LBNL-61768.pdf</a><p>Bit maps for range queries: <a href="http://crd-legacy.lbl.gov/~kewu/ps/LBNL-60891.pdf" rel="nofollow">http://crd-legacy.lbl.gov/~kewu/ps/LBNL-60891.pdf</a>
评论 #3651444 未加载
评论 #3653721 未加载
评论 #3651366 未加载
anemitzabout 13 years ago
I've seen slab allocation schemes (not sure if that's a standard term for it) work in scenarios like inverted indicies where you want to keep your datum key size small (32 vs 64bit).<p>The intuition is that you don't need granular memory addressing for storing large chunks. You only need to have addressability at the min(sizeof(doc)) level. So you can usually store the key in a 32bit int or less.<p>More detail can be found here: <a href="https://gist.github.com/1947190" rel="nofollow">https://gist.github.com/1947190</a>
wollwabout 13 years ago
A very interesting data structure. Reading this inspired me to give an implementation a try. It was a lot harder than it seemed at first and it's certainly a naive attempt but if anyone is interested you can take a look at it here:<p><a href="https://gist.github.com/1947371" rel="nofollow">https://gist.github.com/1947371</a>
lsbabout 13 years ago
So you have 64-bit integer keys and 64-bit integer values, and you need to store these key-value pairs such that you get good lookups from key to list of values and from value to list of keys?<p>How often are you updating? What is reading from this data structure? What's the query structure?<p>If you're adding lots of records often, I'd go for a B+tree. There's also a trie (and a B-trie), if your keys share lots of common prefixes.<p>If you are making consecutive queries, then you'd do well to take advantage of what pages are in L1 and L2 cache, but if you're randomly picking things, that's less important.
评论 #3650868 未加载
helveticamanabout 13 years ago
<i>(particularly in the “refine” and “access” methods, where utilizing the popcnt SSE4.2 instruction would no doubt have a massive impact on performance.</i><p>)