Why Searching Through 500M Pwned Passwords Is So Quick

304 pointsby pstadlerabout 7 years ago

21 comments

zarothabout 7 years ago

Ummm... because it’s an O(1) array lookup not a search at all? Infuriating.It’s read-only static data. Spending even 60ms on the response is ridiculous. Reading from files in blob storage... WTF?Ctrl-F Redis - was disappointed.Actually, even forget Redis. Pre-generate each of the 1 million possible HTTP responses and store in a string array. The 5 character hex is the index into the array. Write < 100 lines of Go to load the data structure and serve it. What am I missing?This is like “Hello World” in those HTTP Framework Benchmarks that used to make the rounds every few months.

评论 #16476859 未加载

评论 #16475255 未加载

评论 #16475747 未加载

评论 #16478523 未加载

评论 #16475588 未加载

dx034about 7 years ago

Just putting this on a vps or cheap dedicated server with 16gb ram would've led to sub-ms response times at much lower costs (if you don't get Azure and Cloudflare for free like him). At those response speeds, scalability is also not really an issue if you cache aggressively at the edge.Argos is then nice to have but not really necessary. If the server responds back <1ms, those 30% saved RTT are probably not detectable for the user.

评论 #16474038 未加载

skrebbelabout 7 years ago

Couldn't he just pregenerate all 1.048.576 responses, load them somewhere in RAM (or just a bunch of HTML files on an nginx with caching on) and be done with it? I mean he writes that a single response, gzipped, averages 10kb so that's only 1GB of RAM in total.Even better: host this on a service like Netlify and not even have the 30 day cache timeout Troy has here (which means 30 days old info in case of new breaches). Just regenerate the entire set on the dev box whenever there's a new breach (should be fast enough, it's a linear search & split) and push it to Netlify, it'll invalidate all edge caches automatically.

评论 #16474985 未加载

评论 #16477982 未加载

barrkelabout 7 years ago

When you don't need transactions, don't have a cache invalidation problem, and are querying read-only data, then this architecture - or really any architecture that takes advantage of what makes HTTP scalable, mostly idempotent, cacheable responses to GETs - makes sense.

评论 #16472856 未加载

评论 #16473014 未加载

sleepychuabout 7 years ago

Surprised Troy is so pro-cloudflare. I feel like they create a lot of security headaches.

评论 #16472506 未加载

评论 #16473324 未加载

评论 #16475566 未加载

评论 #16472675 未加载

评论 #16472473 未加载

darkportabout 7 years ago

I love the k-Anonymity model. Makes it actually feasible to check passwords against HIBP when carrying out password audits for clients. Shameless plug but I've added it to my Active Directory audit password tool: <a href="https://github.com/eth0izzle/cracke-dit" rel="nofollow">https://github.com/eth0izzle/cracke-dit</a>

manigandhamabout 7 years ago

Lots of suggestions in this thread about better architecture but they all seem to forget that this is designed to be minimal in cost, complexity and maintenance while delivering 100% availability and great performance.While Redis or a VM would be faster, that's way more overhead compared to a few cloud functions and table storage. This whole thing is event-driven and easy to build with just your browser, along with having cheap and granular billing. Cloudflare also already caches the responses so there's really no need for the origin to be perfect.

yupyupabout 7 years ago

Somewhat related (and nitpicky), but there are some spelling errors (derrivation, seperate...) on the Cloudflare post that explains k-anonimity:<a href="https://blog.cloudflare.com/validating-leaked-passwords-with-k-anonymity/" rel="nofollow">https://blog.cloudflare.com/validating-leaked-passwords-with...</a>P.S.: As a non-native speaker had to look those words up to check them, as I trusted the spelling from an official blog post.

评论 #16473591 未加载

StavrosKabout 7 years ago

Here's a slightly more easily auditable version of the checker, in Python:<a href="https://www.pastery.net/wwzqua/" rel="nofollow">https://www.pastery.net/wwzqua/</a>The bash one was fine, I just prefer the readability of Python to make sure I know that only my truncated hash version is ever sent.

评论 #16476214 未加载

zcamabout 7 years ago

Wouldn't using a simple bloom filter make sense in their case? Just build the thing offline and your app loads it in RAM at startup.

评论 #16473028 未加载

评论 #16482093 未加载

评论 #16473263 未加载

评论 #16473032 未加载

评论 #16473072 未加载

euroclydonabout 7 years ago

Anyone know, if we permute all 6-16 character length alpha numeric strings, how many would would have their sha-1 hash be a match for a given five character prefix?I’m certainly not saying I think this is an issue! I’m just academically curious about the number and how to go about calculating it.

评论 #16473076 未加载

评论 #16477139 未加载

评论 #16477004 未加载

_pdp_about 7 years ago

...or CloudFlare/CloudFront plus DynamoDB table with primary key of the first/last n-number of characters from the hash with potential secondary index for filtering.Btw, indexing can be done cheeper with Google I think.There is also another way (probably better) and that is to use s3. 1tb can be stored for as little as $20 - the rest is endpoint caching.Luckily for all of us it is easier than ever to single-handedly scale to millions of users at minimal cost.

评论 #16478309 未加载

TorKlingbergabout 7 years ago

Minor complaint: Start the blog post with a link to the service you are talking about. I actually have trouble finding it.

评论 #16472513 未加载

jwilkabout 7 years ago

TL;DR why brotli is HTTPS-only: some middle-boxes mangle responses with content encodings they don't know.

aplorbustabout 7 years ago

What does he do with the logs of all the passwords submitted in searches?

评论 #16476657 未加载

frogpeltabout 7 years ago

How am I supposed to pronounced 'pwned'?Can't we find something other than 4chan language to describe this?

评论 #16475705 未加载

评论 #16475428 未加载

评论 #16475105 未加载

评论 #16475112 未加载

Quarrelsomeabout 7 years ago

That password header warning is the coolest security improvement I've seen online.

kpennellabout 7 years ago

Was expecting an Algolia ad but was delightfully surprised.

Sir_Cmpwnabout 7 years ago

Starting to get a little uncomfortable with how hard this is being pushed on HN right now.<a href="https://hn.algolia.com/?query=troyhunt.com&sort=byDate&prefix=false&page=0&dateRange=all&type=story" rel="nofollow">https://hn.algolia.com/?query=troyhunt.com&sort=byDate&prefi...</a>

评论 #16474175 未加载

评论 #16474193 未加载

zxcmxabout 7 years ago

CloudFlare can read every password submitted through their service and here is why that's so great...It's beatifully elegant, because...What? This is also the same company that spilled memory all over every cache everywhere.

评论 #16472518 未加载

评论 #16472561 未加载

ianhawesabout 7 years ago

This is slightly OT and probably not a popular opinion, but does anyone else feel that Troy having this massive dataset of emails is unethical?I definitely believe it is illegal and was surprised that during his recent visit to the US that the FBI did not arrest him.

评论 #16473280 未加载

评论 #16473563 未加载

评论 #16473274 未加载

评论 #16474570 未加载

评论 #16473259 未加载

评论 #16475535 未加载

评论 #16473315 未加载

评论 #16473814 未加载

评论 #16473296 未加载