Ummm... because it’s an O(1) array lookup not a search at all? Infuriating.<p>It’s read-only static data. Spending even 60ms on the response is ridiculous. Reading from files in blob storage... WTF?<p>Ctrl-F Redis - was disappointed.<p>Actually, even forget Redis. Pre-generate each of the 1 million possible HTTP responses and store in a string array. The 5 character hex is the index into the array. Write < 100 lines of Go to load the data structure and serve it. What am I missing?<p>This is like “Hello World” in those HTTP Framework Benchmarks that used to make the rounds every few months.
Just putting this on a vps or cheap dedicated server with 16gb ram would've led to sub-ms response times at much lower costs (if you don't get Azure and Cloudflare for free like him). At those response speeds, scalability is also not really an issue if you cache aggressively at the edge.<p>Argos is then nice to have but not really necessary. If the server responds back <1ms, those 30% saved RTT are probably not detectable for the user.
Couldn't he just pregenerate all 1.048.576 responses, load them somewhere in RAM (or just a bunch of HTML files on an nginx with caching on) and be done with it? I mean he writes that a single response, gzipped, averages 10kb so that's only 1GB of RAM in total.<p>Even better: host this on a service like Netlify and not even have the 30 day cache timeout Troy has here (which means 30 days old info in case of new breaches). Just regenerate the entire set on the dev box whenever there's a new breach (should be fast enough, it's a linear search & split) and push it to Netlify, it'll invalidate all edge caches automatically.
When you don't need transactions, don't have a cache invalidation problem, and are querying read-only data, then this architecture - or really any architecture that takes advantage of what makes HTTP scalable, mostly idempotent, cacheable responses to GETs - makes sense.
I love the k-Anonymity model. Makes it actually feasible to check passwords against HIBP when carrying out password audits for clients. Shameless plug but I've added it to my Active Directory audit password tool: <a href="https://github.com/eth0izzle/cracke-dit" rel="nofollow">https://github.com/eth0izzle/cracke-dit</a>
Lots of suggestions in this thread about better architecture but they all seem to forget that this is designed to be minimal in cost, complexity and maintenance while delivering 100% availability and great performance.<p>While Redis or a VM would be faster, that's way more overhead compared to a few cloud functions and table storage. This whole thing is event-driven and easy to build with just your browser, along with having cheap and granular billing. Cloudflare also already caches the responses so there's really no need for the origin to be perfect.
Somewhat related (and nitpicky), but there are some spelling errors (derrivation, seperate...) on the Cloudflare post that explains k-anonimity:<p><a href="https://blog.cloudflare.com/validating-leaked-passwords-with-k-anonymity/" rel="nofollow">https://blog.cloudflare.com/validating-leaked-passwords-with...</a><p>P.S.: As a non-native speaker had to look those words up to check them, as I trusted the spelling from an official blog post.
Here's a slightly more easily auditable version of the checker, in Python:<p><a href="https://www.pastery.net/wwzqua/" rel="nofollow">https://www.pastery.net/wwzqua/</a><p>The bash one was fine, I just prefer the readability of Python to make sure I know that only my truncated hash version is ever sent.
Anyone know, if we permute all 6-16 character length alpha numeric strings, how many would would have their sha-1 hash be a match for a given five character prefix?<p>I’m certainly not saying I think this is an issue! I’m just academically curious about the number and how to go about calculating it.
...or CloudFlare/CloudFront plus DynamoDB table with primary key of the first/last n-number of characters from the hash with potential secondary index for filtering.<p>Btw, indexing can be done cheeper with Google I think.<p>There is also another way (probably better) and that is to use s3. 1tb can be stored for as little as $20 - the rest is endpoint caching.<p>Luckily for all of us it is easier than ever to single-handedly scale to millions of users at minimal cost.
Starting to get a little uncomfortable with how hard this is being pushed on HN right now.<p><a href="https://hn.algolia.com/?query=troyhunt.com&sort=byDate&prefix=false&page=0&dateRange=all&type=story" rel="nofollow">https://hn.algolia.com/?query=troyhunt.com&sort=byDate&prefi...</a>
CloudFlare can read every password submitted through their service and here is why that's so great...<p>It's beatifully elegant, because...<p>What? This is also the same company that spilled memory all over every cache everywhere.
This is slightly OT and probably not a popular opinion, but does anyone else feel that Troy having this massive dataset of emails is unethical?<p>I definitely believe it is illegal and was surprised that during his recent visit to the US that the FBI did not arrest him.