How we store 400M phone numbers with fast lookup

76 pointsby bilalhusainalmost 10 years ago

19 comments

idorosenalmost 10 years ago

Their solution seems over-engineered for the scale of data they're concerned with.A 10 digit phone number can be encoded in 34 bits. Supposing they used 64 bit wide fields to store the key and target data, that leaves 30 bits (more than their 2 bytes) to play with for flags or whatever target data. To store the whole dataset in a trie or tree-like structure on disk in this model they would need 8 bytes * 400 million phone numbers < 3 terabytes of data. So, a single 3TB or 4TB HDD ($130USD/ea) would suffice, this would give them (in the trie- or tree-like on-disk format) an access/read time in single digit or low double digit milliseconds with an HDD, or below 1 millisecond for 3 striped 1TB SSDs ($300USD/ea), for example. The disk controller's cache and the host's file buffers/cache would work in their favor as well since there are probably plenty of hotspots in the data. mmap could make the job of reading the file/disk even easier (and faster).The data just isn't that big by today's standards.Since false negatives are not possible for a properly implemented Bloom filter, using a Bloom filter in front of this on-disk approach would make it even faster for the negative case, since the disk would only be hit for positive or false positive cases, the latter of which should be relatively rare.EDIT: It might be even simpler/better/faster than a trie/tree to just store the data indexed by some hash function (even just prefix bucketing) and then sequentially scan the collision list for that prefix hash for matches to the target number, which would take advantage of the much better sequential read throughput for most commodity HDDs/SSDs so that only one random read is necessary.

评论 #9914075 未加载

评论 #9916177 未加载

MichaelGGalmost 10 years ago

Similar issue in doing LNP lookups in the US. There's about 500M entries, each mapping to a different LRN. Naively, that's 16 bytes per entry (8 for the number, 8 for the LRN), so 8GB.First optimization is changing the LRNs to a lookup, as there's only ~30K uniques. So that cuts the data to 5GB. That's OK but it'd be better if it was smaller and quicker to load.I implemented a delta-encoding system with multi-level indexing, ISAM style. Data's compressed on a "page" basis, like 4KB or so of data at a time. That way all data is compressed while in memory and lookups can be done on the compressed data directly. A header on each page provides the range it covers, so you can quickly skip to the next page if the index wasn't precise enough.Delta encoding is neat here, because on average, there'll only be gap of 18 numbers from one entry to the next (~9B possible numbers / 500M). I used base128 encoding but simple16 or another way (there's SIMD-optimized formats) would have been even better.Using this method, the entire dataset only requires about 600-800MB of RAM.

donarbalmost 10 years ago

Their basic premise is all wrong. The list of people that actually want spam phone calls would be very short and is a faster search target.

评论 #9913950 未加载

fencepostalmost 10 years ago

The thing that strikes me is that while the data being searched for has been optimized down to 2 bytes of bitfields, the same hasn't been done with the phone numbers - area codes in India appear to be variable-length (2, 3, ? digits) with that length being part of the 10 digits, and presumably mobile numbers are handled similarly either with their own area codes or overlapping. Breaking the data by area code would allow the elimination of significant chunks of the total possible address space, and further separating (as they do) into dense and sparse would further reduce the requirements (since tries, skip-lists, etc. have their own overhead).It comes down to this: while the Indian phone system could in theory allow 10 billion phone numbers, there are not in fact 10 billion legally-allocatable phone numbers and the rules in place likely effectively reduce the legally-allocatable number to somewhere between 1-3 billion numbers. Dividing based on those rules may both reduce the problem to something requiring less engineering time and simiplify possible marketing-related factors (e.g. selling access/systems targeted only to specific area codes).

评论 #9913161 未加载

insaneirishalmost 10 years ago

Relevant: <a href="http://yourdatafitsinram.com" rel="nofollow">http://yourdatafitsinram.com</a>

fasteoalmost 10 years ago

We are using a constant database for a similar use-case (MNP database), albeit much smaller, around 15 million phone numbers. It requires almost no RAM and you update the database as a whole with no downtime.[1] <a href="https://en.wikipedia.org/wiki/Cdb_(software)" rel="nofollow">https://en.wikipedia.org/wiki/Cdb_(software)</a>

评论 #9911997 未加载

评论 #9912628 未加载

cybernytrixalmost 10 years ago

This scale is pretty small and can be done on my laptop. If you need to do such operations at a truly large scale, this is an interesting article: <a href="http://alexbowe.com/rrr/" rel="nofollow">http://alexbowe.com/rrr/</a> C++ implementation: <a href="https://github.com/simongog/sdsl-lite/blob/master/README.md" rel="nofollow">https://github.com/simongog/sdsl-lite/blob/master/README.md</a>

anilgulechaalmost 10 years ago

Wouldn't some manageable amount of RAM and a hash table be the simplest and fastest solution?

d_e_solomonalmost 10 years ago

Thanks for posting this article. Would this be a good use case for a bloom filter?

评论 #9913028 未加载

_justinalmost 10 years ago

I had interned in a similar company, they had this sorted out way back then, using bloom filters. This is the github link if anyone is interested: <a href="https://github.com/mobmewireless/fast-filter" rel="nofollow">https://github.com/mobmewireless/fast-filter</a>

timeiscoffeealmost 10 years ago

Article seems down. Here is a google cached version:<a href="http://webcache.googleusercontent.com/search?q=cache:http://sparktg.com/engineering/" rel="nofollow">http://webcache.googleusercontent.com/search?q=cache:http://...</a>

bilalhusainalmost 10 years ago

Earlier submission tanked on HN and the link didn't work, resubmitting with lookup timings added. Hopefully, readers will find it interesting this time.

评论 #9911191 未加载

eva1984almost 10 years ago

Interesting article.The idea of two-layer index actually reminds me of the very similar compression technique adapted by Roaring Bitmap.

halaylialmost 10 years ago

My intuition is that a Radix tree trie would make a lot more sense and is more maintainable than such a hack.

daurnimatoralmost 10 years ago

Idea: use a hash table with perfect hashing (<a href="https://en.wikipedia.org/wiki/Perfect_hash_function" rel="nofollow">https://en.wikipedia.org/wiki/Perfect_hash_function</a>, e.g. library: <a href="https://github.com/wahern/phf" rel="nofollow">https://github.com/wahern/phf</a>)

vidarhalmost 10 years ago

This is what tries [1] are for, as well as skip-lists [2]. Skip lists for a suitable sized prefix would likely be more space efficient than a trie, but some trie types such as radix-trees [3] with a sufficiently high radix value, could also be quite efficient.The described approach doesn't seem bad, though it's a bit amusing to see this described as if it's not a very well trodden area of computer science.[1] <a href="https://en.wikipedia.org/wiki/Trie" rel="nofollow">https://en.wikipedia.org/wiki/Trie</a>[2] <a href="https://en.wikipedia.org/wiki/Skip_list" rel="nofollow">https://en.wikipedia.org/wiki/Skip_list</a>[3] <a href="https://en.wikipedia.org/wiki/Radix_tree" rel="nofollow">https://en.wikipedia.org/wiki/Radix_tree</a>

评论 #9912069 未加载

gupta1838almost 10 years ago

Yeah, or you could just, like, use indexes?Remind me not to hire these guys.

alexeyshuryginalmost 10 years ago

Does it worth mentioning? Straightforward approach of storing numbers in a hash map in memory solves the problem. This may require say 40 Gb of ram w/o ANY optimizations.

评论 #9913348 未加载

meeper16almost 10 years ago

Why not OpenLDAP with MDB? It's the fastest in terms of lookups -> <a href="http://www.slideshare.net/ldapcon/whats-new-in-openldap" rel="nofollow">http://www.slideshare.net/ldapcon/whats-new-in-openldap</a>