For anyone out there still using MD5 for any reason, check out this PDF file: <a href="https://www.alchemistowl.org/pocorgtfo/pocorgtfo14.pdf" rel="nofollow">https://www.alchemistowl.org/pocorgtfo/pocorgtfo14.pdf</a> (42MB). You can also rename it to a .NES file and run it in a NES emulator.<p>It's a PDF File which is also a NES ROM that displays its own MD5 sum. The PDF also shows its own MD5 sum a few times. (The MD5 sum also happens to begin with 5EAF00D)<p>When an arbitrary MD5 can be created that easily, it's useless for any cryptographic applications, or even any data integrity.
Disclaimer: This is a fun thought experiment. I'm not looking for actionable results, or advocating for relying on any of this comment for actual security. I'm clearly not a cryptographer; I just think it would be interesting to talk about here, and maybe more educated people could comment on how well these approaches might mitigate the exploits in the article. Play with me in this space.<p>I'm curious if people have any interesting ideas on how to add some seasoning to MD5 to make it more secure. That is, simple, intuitive things you can do in combination with MD5 such that all the pieces in your scheme are still easily understood and don't amount to a new hash algorithm that can only be understood as a black box. Pretend MD5 is the only hash algorithm that has ever been found. Or that you're the Gilligan's Island Professor and MD5 hashes are your coconuts. What are the most potentially useful things you can build out of the most primitive, dumb components?<p>For example:<p>- Output the length of the input (or a hash of the length if you must have a constant-length output)<p>- Hash the input forwards and backwards and produce two hashes. (Remembering that, though the output is 256 bits now, you still only have coconuts to work with.)<p>- Include more complicated variations on the input in the hashes. e.g. start in the middle and oscillate forward and backward over the input, or move the second half of the input in front of the first before hashing, or use the input/hash of the input to seed a pseudorandom re-ordering of the input before hashing, etc.<p>- Format-aware hashing - whatever program will interpret the content of the file can also produce a hash, or some [canonical] interpretation of the content that can be hashed. e.g., for an image format, we could ask the renderer how many iterations of some operation it had to perform to render the output, or in the worst case, hash the bitmap it produced.
See also: "Lifetimes of cryptographic hash functions" - <a href="https://valerieaurora.org/hash.html" rel="nofollow">https://valerieaurora.org/hash.html</a><p>MD5 appears to be firmly in the "fun party trick" stage.
Question for people into cryptography + data archiving....<p>If I want to store data for 500 years, I want future people to be reasonably sure of the integrity of the data, both against 'bit rot', but also deliberate tampering.<p>Is the best available approach to hash the data with a bunch of hash algorithms and publish all the hashes?<p>Then if <i>any</i> hash algorithm remains unbroken, the integrity of my data is certainly still good. An attacker would have to do a simultaneous preimage attack for <i>every</i> hash algorithm I choose to break the scheme, which historically has never happened to my knowledge.
I asked a while ago, whether it’s feasible to get another file to generate a given hash.<p>The answer is no. Not even with MD5.<p>Just be very sure that this is the guarantee you are looking for. Often, for Merkle Trees etc. that is EXACTLY what is needed.<p>Can someone craft input files (eg images) to fool your system? Yes, but only at their own expense.<p>Sometimes if you want the system to be resilient even in the fact of malicious inputs then yes, you should use SHA256 and higher.
in the era of high fidelity generative models, i suspect that the future of media formats will be security forward with built-in protections against length extension attacks.<p>i'm having a hard time imagining any other future than one where people only trust signed media, and media is possibly even signed in hardware by actual physical sensors/compressors.
From the README:<p>"""<p>Colliding any pair of files has been possible for many years, but it takes several hours each time, with no shortcut. This page provide tricks specific to file formats and precomputed collision prefixes to make collision instant. git clone. Run Script. Done.<p>"""<p>Could anyone weigh in on whether these ideas can be generalized to speed up MD5 collisions in general?
Ad exploitations:<p>I've added some inverses of hash functions here:
<a href="https://github.com/rurban/smhasher/tree/master/inverse" rel="nofollow">https://github.com/rurban/smhasher/tree/master/inverse</a>
Side note: I recently saw an example code using tensorflow to determine the private key of some cryptosystem. I can't find it. It was literally operating on the bits of the key and somehow had a loss function. Any ideas?