Their solution seems over-engineered for the scale of data they're concerned with.<p>A 10 digit phone number can be encoded in 34 bits. Supposing they used 64 bit wide fields to store the key and target data, that leaves 30 bits (more than their 2 bytes) to play with for flags or whatever target data. To store the whole dataset in a trie or tree-like structure on disk in this model they would need 8 bytes * 400 million phone numbers < 3 terabytes of data. So, a single 3TB or 4TB HDD ($130USD/ea) would suffice, this would give them (in the trie- or tree-like on-disk format) an access/read time in single digit or low double digit milliseconds with an HDD, or below 1 millisecond for 3 striped 1TB SSDs ($300USD/ea), for example. The disk controller's cache and the host's file buffers/cache would work in their favor as well since there are probably plenty of hotspots in the data. mmap could make the job of reading the file/disk even easier (and faster).<p>The data just isn't that big by today's standards.<p>Since false negatives are not possible for a properly implemented Bloom filter, using a Bloom filter in front of this on-disk approach would make it even faster for the negative case, since the disk would only be hit for positive or false positive cases, the latter of which should be relatively rare.<p>EDIT: It might be even simpler/better/faster than a trie/tree to just store the data indexed by some hash function (even just prefix bucketing) and then sequentially scan the collision list for that prefix hash for matches to the target number, which would take advantage of the much better sequential read throughput for most commodity HDDs/SSDs so that only one random read is necessary.
Similar issue in doing LNP lookups in the US. There's about 500M entries, each mapping to a different LRN. Naively, that's 16 bytes per entry (8 for the number, 8 for the LRN), so 8GB.<p>First optimization is changing the LRNs to a lookup, as there's only ~30K uniques. So that cuts the data to 5GB. That's OK but it'd be better if it was smaller and quicker to load.<p>I implemented a delta-encoding system with multi-level indexing, ISAM style. Data's compressed on a "page" basis, like 4KB or so of data at a time. That way all data is compressed while in memory and lookups can be done on the compressed data directly. A header on each page provides the range it covers, so you can quickly skip to the next page if the index wasn't precise enough.<p>Delta encoding is neat here, because on average, there'll only be gap of 18 numbers from one entry to the next (~9B possible numbers / 500M). I used base128 encoding but simple16 or another way (there's SIMD-optimized formats) would have been even better.<p>Using this method, the entire dataset only requires about 600-800MB of RAM.
The thing that strikes me is that while the data being searched for has been optimized down to 2 bytes of bitfields, the same hasn't been done with the phone numbers - area codes in India appear to be variable-length (2, 3, ? digits) with that length being part of the 10 digits, and presumably mobile numbers are handled similarly either with their own area codes or overlapping. Breaking the data by area code would allow the elimination of significant chunks of the total possible address space, and further separating (as they do) into dense and sparse would further reduce the requirements (since tries, skip-lists, etc. have their own overhead).<p>It comes down to this: while the Indian phone system could in theory allow 10 billion phone numbers, there are not in fact 10 billion legally-allocatable phone numbers and the rules in place likely effectively reduce the legally-allocatable number to somewhere between 1-3 billion numbers. Dividing based on those rules may both reduce the problem to something requiring less engineering time and simiplify possible marketing-related factors (e.g. selling access/systems targeted only to specific area codes).
We are using a constant database for a similar use-case (MNP database), albeit much smaller, around 15 million phone numbers. It requires almost no RAM and you update the database as a whole with no downtime.<p>[1] <a href="https://en.wikipedia.org/wiki/Cdb_(software)" rel="nofollow">https://en.wikipedia.org/wiki/Cdb_(software)</a>
This scale is pretty small and can be done on my laptop. If you need to do such operations at a truly large scale, this is an interesting article:
<a href="http://alexbowe.com/rrr/" rel="nofollow">http://alexbowe.com/rrr/</a>
C++ implementation:
<a href="https://github.com/simongog/sdsl-lite/blob/master/README.md" rel="nofollow">https://github.com/simongog/sdsl-lite/blob/master/README.md</a>
I had interned in a similar company, they had this sorted out way back then, using bloom filters. This is the github link if anyone is interested: <a href="https://github.com/mobmewireless/fast-filter" rel="nofollow">https://github.com/mobmewireless/fast-filter</a>
Article seems down. Here is a google cached version:<p><a href="http://webcache.googleusercontent.com/search?q=cache:http://sparktg.com/engineering/" rel="nofollow">http://webcache.googleusercontent.com/search?q=cache:http://...</a>
Earlier submission tanked on HN and the link didn't work, resubmitting with lookup timings added. Hopefully, readers will find it interesting this time.
Idea: use a hash table with perfect hashing (<a href="https://en.wikipedia.org/wiki/Perfect_hash_function" rel="nofollow">https://en.wikipedia.org/wiki/Perfect_hash_function</a>, e.g. library: <a href="https://github.com/wahern/phf" rel="nofollow">https://github.com/wahern/phf</a>)
This is what tries [1] are for, as well as skip-lists [2]. Skip lists for a suitable sized prefix would likely be more space efficient than a trie, but some trie types such as radix-trees [3] with a sufficiently high radix value, could also be quite efficient.<p>The described approach doesn't seem <i>bad</i>, though it's a bit amusing to see this described as if it's not a very well trodden area of computer science.<p>[1] <a href="https://en.wikipedia.org/wiki/Trie" rel="nofollow">https://en.wikipedia.org/wiki/Trie</a><p>[2] <a href="https://en.wikipedia.org/wiki/Skip_list" rel="nofollow">https://en.wikipedia.org/wiki/Skip_list</a><p>[3] <a href="https://en.wikipedia.org/wiki/Radix_tree" rel="nofollow">https://en.wikipedia.org/wiki/Radix_tree</a>
Does it worth mentioning? Straightforward approach of storing numbers in a hash map in memory solves the problem. This may require say 40 Gb of ram w/o ANY optimizations.
Why not OpenLDAP with MDB? It's the fastest in terms of lookups -> <a href="http://www.slideshare.net/ldapcon/whats-new-in-openldap" rel="nofollow">http://www.slideshare.net/ldapcon/whats-new-in-openldap</a>