Multihash – self-describing hashes

163 pointsby _prometheusabout 8 years ago

17 comments

_prometheusabout 8 years ago

Hey everyone. The SHAttered break made us release a new website for Multiformats: <a href="http://multiformats.io" rel="nofollow">http://multiformats.io</a> which includes a page explaining Multihash -- <a href="http://multiformats.io/multihash" rel="nofollow">http://multiformats.io/multihash</a> --The page goes through a bunch of examples of how a multihash would work in practice, why it is useful to do this, and goes through some FAQs.I'm writing a post about all this that I'll push out soon, but wanted to drop the site here now.Please, as you make new systems or start to upgrade your systems out of SHA1 (or MD5!!!), please use Multihash so that your upgrades will be much, much easier in the future. This makes rolling up a matter of recompiling some tools, and by far not facing the horrible breakages that occur when tools and scripts assume certain things (like the hash digest of git will always be 160bits... now git faces breaking a lot of things and may have to stick with 160bits of a truncated SHA3 or BLAKE2...)Oh also-- definitely check out BLAKE2, it's a fantastic hash function from Zooko, JP Aumasson, Samuel Neves, and Christian Winnerlein -- <a href="https://blake2.net" rel="nofollow">https://blake2.net</a> -- use it for all your hashing needs! So much faster than SHA2, and has the same "unbrokenness" that SHA3 enjoys. (And, I believe deep cryptanalysis has gone into BLAKE, Keccak is not particularly safer)The Multiformats site also has a Multiaddr description, but that's far from complete and doesn't have the examples. The other multiformats arent well explained on the website yet. Sorry, coming soon :) we wanted to get this out ASAP -- PR's accepted at <a href="https://github.com/multiformats/website" rel="nofollow">https://github.com/multiformats/website</a> (oh and yes, TLS certs coming too)

评论 #13740074 未加载

评论 #13739488 未加载

评论 #13739602 未加载

DannyBeeabout 8 years ago

IMHO: Using varint for this is silly, and just wastes decoding time. (Doing because proto/etc did it, when they have wildly different use cases, it a bad idea.)The example they purport to give even proves it: " For example, a 100 TB file in IPFS may have as many as 400 million subobjects, which would mean 400 million hashes.400,000,000 hashes * (7 - 2) bytes = 2 GB"Right, so uh, that's 0.002% of that 100tb. (IE who cares)(This already appears to be the difference between what it would normally take, at 7 bytes, and what it takes with varint, which is ~17 bits per int)But assume we go crazy, and use 8 bytes per hash.we have 3.2 gigabytes there. vs 800 megabytes for your varint encoding. Or, for our 100 terabyte file, approximately, "who cares" vs "who cares".if you could transfer at a gigabyte per second, the version with varint would take ~28 hours. The version with fixed-size 8 bytes would take ... 2 seconds longer.Remind me again how this is showing it's a good idea?Also, if you really wanted to use variable size integers, variable byte encoding is pretty much the slowest way to do it.See, e.g, <a href="https://arxiv.org/pdf/1209.2137.pdf" rel="nofollow">https://arxiv.org/pdf/1209.2137.pdf</a>Using variable sized integers here doesn't save space, precisely because th , but it does make it more annoying to splat these in and out of memory.

评论 #13742831 未加载

advisedwangabout 8 years ago

Looks awesome, so long as its easy to ensure attackers can't quietly make you use a weak hash. Some JWT implementations quietly allowed the "none" algorithm which allowed an attacker to forge creds [1]. Any similar "pluggable" system needs to learn fro this and be careful to avoid letting attackers pick weak algorithms.[1] <a href="https://auth0.com/blog/critical-vulnerabilities-in-json-web-token-libraries/" rel="nofollow">https://auth0.com/blog/critical-vulnerabilities-in-json-web-...</a>

评论 #13739481 未加载

smsm42about 8 years ago

I'm not sure I like that much that there's a fixed table purporting to contain IDs for all hashes ever used in the universe. It sound both inflexible - each time somebody uses a new parameter or new hash modification, it has to be updated, and hard to use - you have to drag this table around in all code that uses the protocol and remember to update it all the time. And all seems in service of saving a couple of bytes - maybe there would be better to have more verbose, but more flexible scheme, that allows to relatively easily add and identify hashes and read them by many tools?I have same reservations about variable-int thing - it would sometimes save 2-3 bytes, but makes processing with simple tools much harder as many tools make it easy to cleave arbitrary strings into fixes pieces without knowing string's specific content, but variable-length fields make it significantly harder. Is the gain really worth the complications?I've seen the FAQ but I guess I am not sure I agree with the authors on the question that what they did is actually the best...

评论 #13740182 未加载

评论 #13740840 未加载

评论 #13740123 未加载

niftichabout 8 years ago

I wrote about Multihash the other day [1], but in that comment I refrained from outright criticism. However, my biggest reservation about Multihash is that it deliberately pollutes the value space of a hash digest, while in good faith claiming to domain-segregate the value space of a digest. It's got no magic number, no magic anything, really; nor can it, since the output of a hash digest is an opaque binary value with any and all outputs equally likely.Therefore, while before multihash there was sometimes no way to figure out which hash function produced a given digest, after multihash there is no way to figure out whether a particular digest is actually a multihash value, or a real raw digest. The fact that it's a multihash value has to be inferred from context or out-of-band just like before, in which case it brings little additional value that the particular hash function is encoded inside the value, when it could've just as well have been communicated the same external context.Then comes the problem of the lookup table, which was left overridable instead of globally prescriptive, so without additional context you don't even know whether to use the global lookup, or someone else's special one.I don't find their arguments laid out on the website persuasive. Really, this is all information that should have been put somewhere else, not into the digest, without affecting any of the "insecure warning" features that implementations may bring. For IPFS, the first part of the URI would have been an excellent place, function names and parameterizations and all, or if they're trying to save space like they admit, the values from the global lookup encoded as a path element. Like it or not, each hash function creates its own sub-tree of all of its possible digests, and multihash values on their own are not suitable to globally an unambiguously describe resources, because they're only deterministic if it's known that the context is "just like the defaults in IFPS". And for these reasons, I feel it muddies content-addressing with no real gain, while all of its good ideas could have been implemented differently inside of IFPS to better effect.[1] <a href="https://news.ycombinator.com/item?id=13726589" rel="nofollow">https://news.ycombinator.com/item?id=13726589</a>

dwheelerabout 8 years ago

Multihash looks cool. It looks a little tricky to apply directly to git, though. The problem is that git and git users treat the first few bytes specially. In git, directories are split on the first few bytes, and users particularly use the first few bytes to distinguish between commit ids.I wonder if multihash might be useful in git if slightly reordered (so that at least the first several bytes of the hash function are first).I'm thinking out loud; I'd be interested in hearing others' thoughts.

评论 #13741066 未加载

cryptarchabout 8 years ago

Cool to find this on HN!Just last Thursday I added this to an "Oh fuck, we're still using MD5 for passwords?" issue, Multihash should provide a really easy way of migrating to a new hash once we've "Multihashified" all existing MD5 hashes in the db.

评论 #13739992 未加载

besselheimabout 8 years ago

Seems rather like the ASN.1/DER based encodings already used in crypto to describe hash outputs, except using an integer rather than an OID to describe the hash type.

jnwatsonabout 8 years ago

Crypto has a long history of using ASN.1 for its data types. I'd prefer defining the data types in that and calling it a day. It isn't like DER would be any harder to parse than a new byte-oriented custom format like this.

runeksabout 8 years ago

> It is useful to write applications that future-proof their use of hashes, and allow multiple hash functions to coexist.Is the suggestion that applications, which have now switched to SHA-256, should serialize this using Multihash, in order to be able to easily switch to a new hash function in 15 years, when SHA-256 is broken?I think, for most cases, it makes more sense for a protocol/application to just define that v1 uses SHA-256, and, if this ever breaks, we'll update the protocol/application version (rather than change the protocol's hash serialization format). Unless you explicitly need to support many different hash functions.For example, let's say your app needs to securely hash many files, and store all these hashes somewhere. Would you rather serialize thousands of hashes, all of the same type, each with a pre-pended identifier and length? Or would it make more sense to put the hash function identifier in a single place, and use this for all the hashes? So perhaps Multihash would benefit from separating the versioning scheme from the serialization format of the hash itself?

评论 #13742343 未加载

jvehentabout 8 years ago

ASN.1 just called and wants it's encoding back...

nickcwabout 8 years ago

MD5 doesn't seem to be in the table...That is going to make it harder to change code to use multihash then migrate away from MD5 gradually.MD5 may be broken as a crypto hash, but there are probably more MD5 hashes than any of the others in the table (eg one on every S3 object!) so not having a representation for it seems short sighted.

评论 #13740411 未加载

philsnowabout 8 years ago

I like the idea of multihash. Using a registry mapping hash names to numbers seems like it might hinder adoption, though.If I want to use a hash that isn't yet in multihash's registry (or will never be because reasons, but I nonetheless want to use it), what do I do?I could make a forked copy of the registry where I allocate my own number to the hash, but, I have to guess very well what numbers will not be used by future allocations in multihash. If I guess wrong, I might have to go through every hash I've encoded with my forked registry and re-encode them, which could be very onerous if they're spread across s3, glacier, cold storage, etc.If there is a well-defined set of numbers that are set aside for personal / non-sanctioned use, I have the same problem when examining hashes from other parties, because they may have chosen a different number for a particular hash, or we may have conflicts. Likewise, when a hash "graduates" from the personal space to the official mapping, all the old hashes are now encoded "wrong": they'll keep working, as long as you ensure the hash keeps a number in the personal space, but now you have to choose whether to, when making new hashes, to use the official number "for consistency (with the standard)" or the personal one, also "for consistency (with your existing hashes)".It looks like every hash in multihash has a number (like 0x13), a name (like "sha-512"), and a description ("The SHA512 Cryptographic Hash Function") [0]. I suggest letting people use e.g. "{name}:{len}:{hash}" if they choose to make the tradeoff in favor of readability / portability (with lower performance and marginally higher storage cost).Then when those hashes get read by some other person's copy of multihash, they'll either map the {name} to the local number (if if it's different from the number of the multihash library that encoded the hash) or if the {name} doesn't show up they can throw an appropriate error. This is forwards compatible and I think will allow more people to be comfortable adopting multihash, because they won't have to think so much up-front about how to encode hashes that aren't in the registry.[0] <a href="https://github.com/multiformats/multihash/pull/41/files#diff-71c19e973ffcdb60141b3cc511acadb4R59" rel="nofollow">https://github.com/multiformats/multihash/pull/41/files#diff...</a>

评论 #13740190 未加载

y7about 8 years ago

I wonder if storing the hashing parameters next to the hash is the right approach to take for systems. Your system isn't going to support all hashes (adds a lot of library bloat, you don't want to support weak hashes), so you'll probably want one specific algorithm/parameter set, with room for upgrades in case weaknesses get discovered (not likely to occur often). A single byte/varint, starting at 0 and incrementing for every new parameter set sounds reasonable. A downside would be interoperability, but if that's your main concern then perhaps a human-readable string representation of the hash algorithm/parameters would be preferable.What are the benefits of using Multihash?

评论 #13740290 未加载

评论 #13741130 未加载

评论 #13741131 未加载

barrystaesabout 8 years ago

I dont get the trend of security hashes / encryption trying to describe itself with a header, it just makes breaking this hashed / encrypted data faster.. no humans needed, fully automated hacking at scale!

评论 #13743017 未加载

mtdewcmuabout 8 years ago

This seems kind of like a cross between a uuid and a plain hash. Perhaps it would be better to incorporate the hash into a uuid. Uuids already have metadata.

评论 #13739784 未加载

tehlikeabout 8 years ago

Why not just encode a proto:message Hash { optional HashFunction hash_function = 1; optional bytes digest = 2; }