Hello Hacker News,<p>I would really like to get some input from the HN community on this topic I've been researching.<p>It's a bit of a legend/vague story, but I'm trying to find out whether there's any truth to it.<p>In short: Dutch electronics technician Jan Sloot (1945-1999) supposedly invented an incredible compression technique that could potentially change the digital landscape. Investors like Tom Perkins and Dutch billionaire Marcel Boekhoorn were interested in signing a deal.<p>There's a lot of mystery around the actual technology but what I could find was this:<p>Link: jansloot.telcomsoft.nl/Sources-1/More/CaptainCosmos/Not_Compression.htm<p>An article by a fan website that discusses the technology a little bit. Atleast, their interpretation of what the technolgy was.<p>What the article claims in summary is this: if there's a (e.g.) 4 GB shared 'magic dictionary' that exists on both clients and the only data that is transmitted is references to parts of this dictionary, massive file size reduction can be achieved if the purpose is transmitting/distributing the data to many parties (e.g. video streaming services like YouTube).<p>An analogy given is this: a PDF document which references fonts instead of embedding them, hence reducing the file size of the file that's transmitted.<p>What I would like to know from you is whether there's any potential in the above system for file size reduction and/or whether something like this already exists today? Perhaps executed in a similar fashion by other companies?<p>Perhaps some of these concepts are used in modern compression libraries like .7z, I'm not too familiar with those so I can't tell.<p>Long story short, I'm not experienced enough to judge the technology described but I'm curious to know your thoughts on the matter.<p>Kind regards,
Rick Lamers
Shared information at each endpoint can be used to communicate efficiently but it doesn't qualify as "data compression" as that is currently defined. For example, if you have a library of books on your end then I can send you a few byte number (the ISBN) to identify a specific book instead of sending a normally compressed copy of the whole book. If you compare the few byte ISBN to the byte size of the book, of course the compression rate is phenomenal but this normally isn't considered data compression but it is related.<p>A real life example is Binary Delta Compression that Microsoft uses in their Windows Update protocol. If you make minor changes to a file in going from version 1.0 to 1.1 say, then instead of transmitting the whole file WU just sends a file identifier + the bits that need to be modified. This is a trade off between sending data and computation at the sending and receiving end.<p>Every so often I read about someone who has the idea to generalize this. The idea is to use/index the standard Windows installation (or something common on most computers) as a "library" to reference. Then to "compress" a file you send a sequence of pointers to the library necessary to recreate the file you want to send at the other end. The problem with this is twofold. First, it takes a lot of computer power to analyze a file and break it down into library references. Second, you are on the wrong side of 2^n and as the file size increases the probability that it or large pieces of it are contained in the library quickly goes to zero.<p>The only way that Sloot's idea would work is if he found a way to generalize this for random (i.e. already compressed or high entropy) data and a method to define an easily created library on the endpoints that is comprehensive and independent (i.e. a basis) for random data. That is a tall order.
The idea of using compression dictionaries like this is not new. Common compression systems (e.g. zlib) allow you to do this. On the web there's a Google spec. called SDCH to do the same thing.