Just for fun, I decided to pass the output.js file through Google Closure Compiler's advanced optimizations. It does a surprisingly good job at reconstructing part of the strings.<p><pre><code> % npx google-closure-compiler -O ADVANCED --js output.js
(()=>{})["co"+(1/0+[])[4]+"structor"]("co"+(1/0+[])[4]+...
</code></pre>
Not pasting the full thing. But it reduces the output.js file from ~118 KiB to ~9.92 KiB, which is pretty good!<p>There is technically not much stopping the compiler from inferring that 1/0 === Infinity, recognizing (1/0+[])[4] is free of side-effects, and eventually concluding its safe to substitute the whole expression with "n". Google Closure already has optimizations for string concatenation, so if it were able to perform an optimization pass with Infinity, then it would also be able to emit the string "constructor" instead of "co"+(1/0+[])[4]+"structor"
I'm actually surprised at how insanely terrible the "compression" is.<p>Looking at the table at the end, I'm not surprised at all that the "weird" obfuscated code is ~2000x the size of the original source.<p>But I <i>am</i> surprised that that the gzipped weird code is still ~25x the size of the original source, as opposed to ~0.25x for gzipping the original source.<p>After all, the amount information in the weird code should still ultimately be approximately the same as the original source code, right? Or maybe double or something like that. I'm very surprised it's <i>twenty-five times as much</i>.<p>The only reason I can guess is that the "weird" process results in information structures that are represented in an extremely <i>hierarchical</i> way, and gzip is built for <i>stream</i> compression, and is unable to find/represent/compress hierarchical structures?<p>And if that's the case, it makes me wonder if there <i>are</i> any compression algorithms which are able to handle that better? That might not be based on "dictionary words/sequences" as much, but rather attempting to find "nestable/repeatable syntax patterns"?
1) The title is a clickbait, but<p>2) Thanks for leading with an example of a negative result! That's what any researcher faces every day, unlike what gets published, after all
It's not a huge surprise that gzip as a general compression algorithm didn't compress this down any further. I do wonder about a format that was specifically trained on these specific characters though, and the patterns that tend to emerge from the weird compiler. Maybe the chunks at a certain scale would be predictable and thus compressible.<p>Of course at that point you're probably more interested in a common binary format, and should start thinking about wasm instead.
Maybe OT, but does anyone know if there is runtime analysis of JS and/or wasm?<p>Obfuscation is a serious threat to the open web, and things like fingerprinting can be incredibly invasive.<p>Web browsers typically only support static “prettifying” (ie auto indent). I’ve seen websites probe for chrome extensions, canvas and all kinds of APIs. Deobfuscators are often not enough to restore legibility. (I assume disassemblers are similar, but I’ve never tried.)<p>I would love to have a trace/intercept/breakpoints of any external APIs called in order to restore a sense of control over what code websites run. Ideally, integrated in the browser. With WASM gaining popularity, this will become (much) more important, imo.
It might be that GZIP isn’t actually a good format with which to try to compress this data. I would think a compression algorithm that expects a rather large 8 byte character space wouldn’t be very suitable for a 4-bit space
> So, yeah - this isn't a good idea. If the Weird transpiler only changes the encoding of each character with a really weird equivalent, it makes a lot of sense that it doesn't compress better than the source one - the ideal scenario would be to compress the same.<p>If your results disagree with your premise in unsurprising and obvious ways - please start your article with that so I can stop reading it.