TechEcho

Hey HN,If you're working with large static datasets in Go and need fast lookups, you might be familiar with Minimal Perfect Hashing (MPH). The `mph` library (implementing the Compress, Hash, and Displace algorithm - CHD) is a great tool for this. It lets you build immutable key-value stores with guaranteed no hash collisions, perfect for things like indexing, feature lookups, or game data where keys are known upfront.We've been using it heavily, but ran into challenges when building very large tables (millions of complex keys) for a project involving analyzing complex *poker* decision spaces (mapping millions of possible game states to optimal strategies). The original build process, while generally solid, could sometimes be slow, and occasionally fail if it couldn't find a collision-free hash function within its default retry limits for a specific "bucket" of keys. Getting unlucky with the initial random seed meant manual restarts, which wasn't ideal for automation or large datasets.To address this, we've pushed some significant enhancements to a fork of the library ([<a href="https://github.com/aelaguiz/mph" rel="nofollow">https://github.com/aelaguiz/mph</a>](<a href="https://github.com/aelaguiz/mph" rel="nofollow">https://github.com/aelaguiz/mph</a>)):1. *Parallel Build Attempts (`.ParallelAttempts(n)`):* This is the biggest change. Instead of trying just one random seed, `Build()` can now launch `n` build attempts concurrently using different seeds (the first uses the specified/time seed, others are random). The first goroutine to successfully complete the build returns its result, and a `context.Context` is used to signal cancellation to the other attempts. This drastically increases the probability of a successful build for tricky datasets and can often be faster wall-clock time than sequential retries.2. *Build Progress Reporting (`.ProgressChan(...)`):* For long builds, you can now provide an optional `chan mph.BuildProgress`. The builder sends updates on the build stage ("Hashing Keys", "Assigning Hashes", etc.), buckets processed/total, and even the collision count for the current bucket being worked on. This is super useful for visibility, especially when running parallel attempts to see which ones are progressing or getting stuck. Pro-tip: Use a buffered channel!3. *Tunable Parameters (`.BucketRatio()`, `.RetryLimit()`):* You can now configure the `bucketRatio` (memory vs. collision trade-off for the initial hash table) and the `retryLimit` (how many hash functions to try per bucket before giving up). This allows tuning the build based on specific data characteristics or resource constraints.We found these changes made the MPH build process much more reliable and predictable for our demanding use case. The parallel attempts, in particular, turned frustrating random failures into successes just by throwing a bit more compute at the problem concurrently.The core MPH lookup performance remains the same (it's still the fast CHD algorithm), but the build experience for large or complex key sets is significantly improved.Check out the [README](<a href="https://github.com/aelaguiz/mph/blob/master/README.md" rel="nofollow">https://github.com/aelaguiz/mph/blob/master/README.md</a>) for examples of the new features.Happy Hashing!

Show HN: Making Go Minimal Perfect Hashing Builds Faster and More Robust

no comments

Show HN: Making Go Minimal Perfect Hashing Builds Faster and More Robust

no comments