The $5000 Compression Challenge (2001)

181 pointsby ekiauhce6 months ago

31 comments

omoikane6 months ago

The original email thread was from 2001, and it gets posted to HN periodically:<a href="https://news.ycombinator.com/from?site=patrickcraig.co.uk">https://news.ycombinator.com/from?site=patrickcraig.co.uk</a>For another compression challenge that is still ongoing, try "500000€ Prize for Compressing Human Knowledge" (also known as "Hutter Prize"):<a href="http://prize.hutter1.net/" rel="nofollow">http://prize.hutter1.net/</a><a href="https://news.ycombinator.com/item?id=37502329">https://news.ycombinator.com/item?id=37502329</a> - Hutter Prize for compressing human knowledge (2023-09-13, 215 comments)

评论 #42225155 未加载

评论 #42232092 未加载

teo_zero6 months ago

Mike supports $SPORT team the Compressors. He's so sure they are unbeatable that he accepts 50:1 bets against them. Patrick bets 100$ that they won't win this year's championship; Mike accepts. Later, the Compressors announce financial troubles and can't pay the fee to enter the championship, which is then won by another team. Patrick reclaims his 5000$. Mike refuses to pay saying that the Compressors have not been beaten, and the championship result does not disprove his claim that the Compressors are unbeatable, which was the spirit of the bet since the beginning. Mike offers to refund the 100$. Patrick holds that the terms of the bet were met and he deserves the 5000$.

评论 #42231863 未加载

评论 #42236921 未加载

stavros6 months ago

Sorry, if you're trying to hustle people by xstging $100 per try, don't catch the sleight of hand in the "multiple files" question, and accept, you were beaten at your own game, fair and square.

评论 #42225444 未加载

fxtentacle6 months ago

This guy clearly failed because he didn't actually do any compression, he just ab-used the filesystem to store parts of the data and then tried to argue that metadata was not data...But FYI someone else actually managed to compress that exact same data: <a href="https://jasp.net/tjnn/2018/1.xhtml" rel="nofollow">https://jasp.net/tjnn/2018/1.xhtml</a>

评论 #42225431 未加载

评论 #42225046 未加载

评论 #42224865 未加载

评论 #42224936 未加载

评论 #42225920 未加载

评论 #42235599 未加载

ball_of_lint6 months ago

This strategy or something like it legitimately wins the challenge.The challenge (based on expected value) is not to find some compression scheme that will compress 100% or even 50% of files to less than their original size. Instead it's to find any (set of) compression schemes that will compress more than 1/50 = 2% of arbitrary random files to less than their original size. You can construct such a program essentially as they described by asking for a large file size and then searching for a valid decompression program within that file. With standard tools you can make some valid decompression program quite short, which will allow you a reasonable probability that it will appear within a large randomly-generated file.The cost that you pay for using a compression scheme like this is that where the compression scheme works you'll win some bytes, but in the more common case where it doesn't you'll lose some, even for the simple `cat $1 > original` default decompressor.Now, a clever contest creator might arbitrarily choose files that win against such decompressors that they can think of, but Mike appears to have not have done so.

评论 #42232984 未加载

评论 #42239358 未加载

评论 #42233212 未加载

piannucci6 months ago

:shocked_pikachu:Renegadry aside, for those who are more interested in the Information Theory perspective on this:Kolmogorov complexity is a good teaching tool, but hardly used in engineering practice because it contains serious foot-guns.One example of defining K complexity(S, M) is the length of the shortest initial tape contents P for a given abstract machine M such that, when M is started on this tape, the machine halts with final tape contents P+S. Obviously, one must be very careful to define things like “initial state”, “input”, “halt”, and “length”, since not all universal machines look like Turing machines at first glance, and the alphabet size must either be consistent for all strings or else appear as an explicit log factor.Mike’s intuitive understanding was incorrect in two subtle ways:1. Without specifying the abstract machine M, the K complexity of a string S is not meaningful. For instance, given any S, one may define an abstract machine with a single instruction that prints S, plus other instructions to make M Turing complete. That is, for any string S, there is an M_S such that complexity(S, M_S) = 1 bit. Alternatively, it would be possible to define an abstract machine M_FS that supports filesystem operations. Then the complexity using Patrick’s solution could be made well-defined by measuring the length of the concatenation of the decompressor P with a string describing the initial filesystem state.2. Even without adversarial examples, and with a particular M specified, uniform random strings’ K complexity is only _tightly concentrated around_ the strings’ length plus a machine-dependent constant. As Patrick points out, for any given string length, some individual string exemplars may have much smaller K complexity; for instance, due to repetition.

评论 #42236968 未加载

its-summertime6 months ago

> I think that I did not make a mistake except to believe that you truly intended to attempt to compress the data I was sending you.Weird phrase to hear from a digital carnival scam artist.

ccleve6 months ago

I wonder if it would have been possible to win the challenge legitimately?If a randomly-generated file happened to contain some redundancy through sheer chance, you could hand-craft a compressor to take advantage of it. This compressor would not work in general for random data, but it could work for this one particular case.It's a bet worth taking, because the payoff, 50:1 ($5,000 to $100), is pretty good. Play the game 50 times and you might get a file you could compress.The challenge, then, would be for the person offering the bet to generate a really random file that contained no such redundancy. That might not be easy.

评论 #42224907 未加载

评论 #42225027 未加载

评论 #42225057 未加载

teaearlgraycold6 months ago

Don’t offer up a game you aren’t willing to lose. This should have had a mathematical definition and required a formal proof if Mike wanted to play it this way. By allowing for word play he made this a game of interpretation and not a true mathematical game.

echoangle6 months ago

I didn’t quite get the method used to „compress“ the data from the article, maybe this rephrasing helps someone:You basically split the file every time you encounter a specific character, and your compressor just combines all files it finds with the character you split by. If you split at every „X“ Char which might occur 1000 times in the file, the compressor only needs a small script which joins all files and puts an „X“ between them, which is less than 1000 bytes. The „trick“ is storing the location of the Xs you removed in the file sizes of the individual files.

评论 #42234164 未加载

评论 #42231484 未加载

johnfn6 months ago

This is hilarious, but Mike’s behavior boils my blood. To switch the rules when fairly beaten is just so scummy!Of course Patrick’s solution occurred to me immediately once he started talking about splitting up the file. :)Then I got to wondering if it would be possible to compress a large enough file and save a few bytes, in the spirit of what Mike actually wanted. For instance, if you tried encrypting with a bunch of random keys, you’ll find some such that the result ends with n 0s (say), so then your result could be something like encrypted + key + n. Typically you’d expect to find that the length of the encryption key would be roughly the length of the number of 0s, so you wouldn’t actually net any savings. But I think if you searched long enough, you might “get lucky” and find a key that generated more 0s then the length of the key, so you could theoretically win the challenge. Of course you’d have to somehow embed a program which could decrypt and reassemble everything in a tiny amount of space, but it still doesn’t seem impossible like Mike seems to suggest. Perhaps just requires an intractable amount of computing power. :)I’m not sure if any of that made sense. I think I’m using the wrong terms for things.

评论 #42230945 未加载

Xcelerate6 months ago

There's another interesting loophole to these general compression challenges. A universal Turing machine will necessarily compress some number of strings (despite the fact that almost all strings are incompressible). The set of compressible strings varies depending on the choice of UTM, and if your UTM if fixed, you're out of luck for random data. But if the UTM is unspecified, then there exist an infinite number of UTMs that will compress any specific string.To some extent, this resembles the approach of "hiding the data in the decompressor". But the key difference here is that you can make it less obvious by selecting a particular programming language capable of universal computation, and it is the choice of language that encodes the missing data. For example, suppose we have ~17k programming languages to choose from—the language selection itself encodes about 14 bits (log2(17000)) of information.If there are m bits of truly random data to compress and n choices of programming languages capable of universal computation, then as n/m approaches infinity, the probability of at least one language being capable of compressing an arbitrary string approaches 1. This ratio is likely unrealistically large for any amount of random data more than a few bits in length.There's also the additional caveat that we're assuming the set of compressible strings is algorithmically independent for each choice of UTM. This certainly isn't the case. The invariance theorem states that ∀x|K_U(x) - K_V(x)| < c for UTMs U and V, where K is Kolmogorov complexity and c is a constant that depends only on U and V. So in our case, c is effectively the size of the largest minimal transpiler between any two programming languages that we have to choose from, and I'd imagine c is quite small.Oh, and this all requires computing the Kolmogorov complexity of a string of random data. Which we can't, because that's uncomputable.Nevertheless it's an interesting thought experiment. I'm curious what the smallest value of m is such that we could realistically compress a random string of length 2^m given the available programming languages out there. Unfortunately, I imagine it's probably like 6 bits, and no one is going to give you an award for compressing 6 bits of random data.

评论 #42232405 未加载

评论 #42231365 未加载

评论 #42231360 未加载

avmich6 months ago

I particularly like this:> It's not my fault that a file system uses up more space storing the same amount of data in two files rather than a single file.

PaulKeeble6 months ago

I would not have accepted multiple files nor scripting. Scripts offers up the entire contents of the linux operating system and this could allow you to use operating system files as partial data. To really test compression/decompression the rules need to be a lot tighter. The challenge was met and beat because the criteria was bad. It leaves open a number of dishonest approaches.The decompression program needs to be constrained to itself and the singular file to decompress and can not utilise anything other than basic operating system functions for the purpose of reading and writing the constrained files. Any bytes the program accesses otherwise can be used as part of the total count.It's very hard to constrain this without some form of chest being possiblemespecially if you have some years to set it up.

评论 #42225214 未加载

dang6 months ago

Related. Others?The $5000 Compression Challenge - <a href="https://news.ycombinator.com/item?id=9163782">https://news.ycombinator.com/item?id=9163782</a> - March 2015 (168 comments)The $5000 Compression Challenge - <a href="https://news.ycombinator.com/item?id=5025211">https://news.ycombinator.com/item?id=5025211</a> - Jan 2013 (61 comments)The $5000 Compression Challenge - <a href="https://news.ycombinator.com/item?id=4616704">https://news.ycombinator.com/item?id=4616704</a> - Oct 2012 (119 comments)

theyapper6 months ago

This compression challenge discussion raises fascinating questions about how we approach data representation. While I agree with the core information theory principles being cited, I wanted to share something relevant - US Patent 12,136,933 B2 just granted for a fascinating approach called "Kinetic Data Primers" (KDP).Let's step back and consider a different perspective that builds on some key mathematical principles:Every binary file, regardless of its content or randomness, can be viewed as representing one specific large number. This isn't controversial - it's just a different way of looking at the data.Many large numbers can be expressed more efficiently using mathematical notation (e.g., 10^100 is far more compact than writing out 1 followed by 100 zeros). Furthermore, this efficiency advantage tends to increase with larger numbers.These numerical conversions are perfectly lossless and reversible. We can go from binary to decimal and back without losing any information.This naturally leads to some interesting questions: What happens to our usual compression impossibility proofs when we consider perfect numerical transformations rather than traditional pattern matching? Could mathematical expressions capture patterns that aren't obvious in binary form? As numbers get larger, does the potential for more efficient mathematical representation increase?The KDP patent explores some of these ideas in depth. I'm not claiming this solves all compression challenges - but I think it highlights how fresh perspectives and mathematical approaches might reveal new ways of thinking about data representation.Would be curious to hear others' thoughts, especially from those with expertise in number theory or information theory. How do these mathematical properties interact with our traditional understanding of compression limits?

Panzer046 months ago

I would have expected it to be possible to compress a single arbitrary random file with a small program. I would have thought an RNG could generate a file with some weakness that allows you to compress it, although said compressor would be worse at other inputs.Likewise this challenge would have been stronger if the requirement was to compress two provided arbitrary files :P

评论 #42233044 未加载

评论 #42235780 未加载

评论 #42232185 未加载

vunderba6 months ago

Story time. In our Data Structures and Algorithms class during my undergrad, we had a similar challenge posed to us by an associate professor who was quite proud of the fact that it had never been beaten.The sophistication of my compression understanding at the time extended to lempel-ziv and huffman and well... that's about it. So I was like, “Okay, I’m not going to create the world’s best compression method on a whim.” So I started thinking: if I can’t compress the file, maybe I can just fetch it from somewhere else.The first thing I tried was writing a program that searches online by uploading the file onto a public FTP. From there, I could fetch it. Simple enough, right?Well, here’s the issue with that: there’s no guarantee I’d even have internet connectivity. That’s probably the first thing he’d block.So, he gives us the file. I think it was something like 10 MB, and when I looked through it, the entropy was high. Like, really high. Standard compression tools like 7Z or rar wanted NOTHING to do with it.Then I started thinking again. Maybe this guy’s running all the submissions back-to-back. He’s probably processing them in sequence because he’s got tons of entries coming in. So I thought: if I can’t download the file, maybe there's a chance it's already on the machine from somebody's previous simulation - I can find it somewhere on the hard drive. My program might not be a great decompression system, but you know what it is? A fantastic file search.So, I threw together a program that tree searched the entire disk with a simple checksum to verify I’d found the exact right file. If it matched, I’d just copy the file into the current directory and say, “I’m done.” - though if it found the file too quickly it would continue idling for a while so as to avoid suspicion. (I was also banking on the fact that someone else’s program might’ve run before mine, leaving the file lying around somewhere for me to grab.)Of course, I couldn’t make this look too obvious. I mean, imagine submitting a tiny C++ program claiming it could compress a 10 MB file into just a few kilobytes. He’d immediately know it was fake.So, I padded the compressed binary, adding a bunch of random hexadecimal and binary junk to bulk it up to around 5 MB. It was still phenomenal—better than any other method on the market for this type of file—but at least it seemed somewhat plausible. I mean, maybe I’d "discovered" some revolutionary new compression algorithm or something and smashed the weissman score.The contest went on for nearly a month, and the guy had a leaderboard tracking scores. Most people were hovering just slightly above standard compression rates—around 9 MB. Nothing groundbreaking, though.So, it’s a Friday night. I finish the program, pack it all up, and send it off. Then I go to bed. Needless to say, I woke up with a quite an excited response from my professor.

评论 #42233726 未加载

评论 #42237277 未加载

samatman6 months ago

I think Mike is correct. My reasoning by analogy follows.There's a game called Skedaddle, and a Usenet group, rec.players.skedaddle. A well-known player, Mike, offers a $5000 challenge for a feat, called the four-froot-hoot, which he believes is simply impossible to achieve.A fellow skedaddling enthusiast, Patrick, takes him up on the challenge, some baseline terms are negotiated, and Patrick presents his putative four-froot-hoot. Some might say it meet the terms of the challenge.Mike objects, he says "you're a rat, Pat! look at the FAQ, Jack, that's not even skedaddle" and indeed, the F.A.Q. of rec.players.skedaddle does indeed state in plain language, that it ain't even skedaddle so it can't be a four-froot-hoot.You make a bet in the club, you play by the club rules.

评论 #42233655 未加载

moonchild6 months ago

suppose the data were generated by a csprng with 256 bits of state. then in fact there are just slightly more than 256 bits of entropy there (prng state plus the length of the generated file). only, you'd have a snail's chance in hell of actually finding that seed

评论 #42225217 未加载

andrei-akopian6 months ago

I though he would try to net profit through probability by requesting to regenerate the file and hope to get a compressible one in at least 5000/100=50 times. (Though I don't think that would work.)

nojvek6 months ago

I’d take this bet.> With this offer, you can tune your algorithm to my data.One can generate a 1GB file or 10GB file. It is highly likely that there is some form of a repeatable pattern in there to shave off 50-100 bytes by sliding window search.Then the decompressor is essentially - at this index, expand this pattern. The compressed file excludes the range of pattern.One may not always get such a pattern, but on multiple tries it’s very feasible. Payout just needs one win in 50 tries.You could generate a 100GB file. The bigger the file the higher the chance of repeating pattern.The challenge is won if compressed_file + decompressor is less one byte than original file.One could have a self executing decompressor to save some file overhead bytes.

评论 #42232566 未加载

评论 #42237324 未加载

评论 #42230987 未加载

评论 #42230142 未加载

Uptrenda6 months ago

Yeah, I'd strongly, strongly, in fact plead for people not to try this challenge. When it comes to 'compressing' random data: every single bit in random information is 'significant.' I am quite sure that the problem is logically meaningless, mathematically impossible, and a complete waste of effort. But more-so: if it weren't a waste of effort (it is) - I would still highly doubt any algorithm existed that was more efficient than brute force search. In this case -- the search for a function that solves the constraints is going to be like trying to brute force a cryptographic private key value. Simply because every 'bit' of entropy is valuable information.Now lets look at how modern compression actually: works. Images, movies, text-documents, audio files... there is fundamental --structure-- in all this data. In text it might be using only a certain character range, a certain lexical word list, and so on. There may be geometric information in images and movies that could be turned into code to drastically cut down file size. With audio -- you can sample it and discard what you don't need. And so on and so fourth. Structure. But how do you apply any of these methods to random data? There is no structure, you can't sample it, reduce it, simplify it... Every bit has meaning.So please, dear human thinking about compression of random data. It is an insane waste of intellectual effort. It's impossible. Don't do it. I made the same mistake (and setup vast amounts of HPC to work on algorithms.) COMPLETE. Waste of time.

hyperpape6 months ago

It’s amazing that this commentary on the value of prediction markets was written in 2001.

GrumpyNl6 months ago

Is there a link to the file that he had to compress?

trod12346 months ago

This gets reposted quite often.The crux of the ongoing debate over the post has to do with the ambiguity of the stated requirements.For there to be any rational discussion, you need to have proper unique definitions for what is being communicated. Without a definition (identity) no system of truth seeking will provide a definitive truth.Most notably, "compression" has many different meanings currently in practical use. For example many people use the word compression and simply mean that the uncompressed size is more than the compressed size; lossy vs lossless doesn't matter when its good enough, all that matters is its been reduced, and an example of that might be H264 video.Other more rigorous individuals will use it to mean entropy encoding, which has a strict definition, part of which is lossless, and in these cases they often view Shannon's theorems as gospel, and at that point they often stop thinking and may not try to sharpen their teeth on a seemingly impossible task.These theorems are often very nuanced, with the bounds very theoretical or still undiscovered, though still seemingly correct. When someone is told something is impossible, they don't fully apply themselves to try to find a solution. Its a worn path that leads nowhere, or so they imagine.This mindset of learned helplessness prevents a nuanced intuitive understanding of such problems and any advances stagnate. People take the impossibility at face value, and overgeneralize it (without that nuanced understanding), this is fallacy and it can be very difficult to recognize it when there is no simple way to refute it on its face.Here is something to consider. Shannon's source coding theorem relies on the average uncertainty of probabilities as a measure of entropy (information). Some exploration is needed to understand what makes this true, or where it fails. It has been my experience that only a very few exceptional people ever attempt this process.The main question that jumps out in my mind since compression has been a thing I've been interested in for years; is, can an average entropy in statistics tell us accurately whether this is true in systems with probabilities and distributions that are a mix of mathematical chaos but also contain regularity within their domains? Statistics can be unwieldy beasts, and act as regular sources of misunderstanding and falsehood without a very rigorous approach.For a similar problem, people may find this article particularly relevant with regards to the chaotic 3-body problem in astrophysics.<a href="https://www.aanda.org/articles/aa/full_html/2024/09/aa49862-24/aa49862-24.html" rel="nofollow">https://www.aanda.org/articles/aa/full_html/2024/09/aa49862-...</a>

Supermancho6 months ago

> It's not my fault that a file system uses up more space storing the same amount of data in two files rather than a single file.It's not interesting to see people play word games with each other in the HN comments, so why is it interesting to see this kind of disingenuous behavior? Trolls have always existed. I don't get what's interesting about it.

评论 #42224893 未加载

评论 #42225068 未加载

评论 #42224897 未加载

评论 #42225492 未加载

permo-w6 months ago

this sounds like a Nigerian Prince scammer set-up

evilotto6 months ago

This is why we can't have nice things.Some volunteer puts in time maintaining something and makes a claim that is obviously correct and - most likely in jest - promises cash to anyone who shows it to be wrong. Then some rules lawyer decides that he's better than the unpaid volunteer and does his best to humiliate him, just for the lulz.Is it any surprise that volunteers rarely stick around for long?Nowadays lots of companies run bug bounty programs where if you find a bug and report it responsibly you can get real cash. This is a valuable service. And I'm positive that these programs get gobs of submissions from people trying to see if they can find a bug in the rules, rather than in the software. If the people vetting the submissions to such programs weren't getting paid for it, I'm sure they would quit in disgust.

评论 #42250529 未加载

benchmarkist6 months ago

Take random bytes from /dev/urandom, encrypt it. There is no way any compressor can reduce the size of that file. I'm pretty sure there is no way the second player can win this game. Actually, the encryption part isn't even necessary. There is no compressor that can reduce the size of a random stream of bytes.

评论 #42225052 未加载

评论 #42225454 未加载

评论 #42225062 未加载

misiek086 months ago

I read „decompressor and compressed file”, not „fileS”. There is only one person correct here

评论 #42225051 未加载

评论 #42234008 未加载