YouTube-dl has an interpreter for a subset of JavaScript in 870 lines of Python

473 pointsby yuutaover 2 years ago

20 comments

lolinderover 2 years ago

To be clear, this is an extremely tiny subset of JS. It looks like they only implemented the features needed to run a very specific function. For example, the only symbol allowed after "new" is "Date", everything else throws an exception.It's still fun that it's there, but it's not as big a deal as it sounds from the tweet.

评论 #32793685 未加载

评论 #32799722 未加载

评论 #32793690 未加载

Uptrendaover 2 years ago

Anyone who has ever pulled a website from a script knows the pain that is Javascript. Normally you want to just get some text and work out the API actions but a lot of sites use horribly obfuscated Javascript -- either because that's what modern web development is (lolz) -- or because its part of their 'security.' That means if you want to write browser-based bots properly -- you ought to use a browser. There are special browsers that run 'headlessly' or are designed mostly for bot use. Like <a href="https://www.selenium.dev/" rel="nofollow">https://www.selenium.dev/</a> which plugs into a few different 'browser engines.'But now you have another problem. Your simple script goes from being small, simple, self-contained, and elegant gem, to requiring a full browser, specialized drivers, and/or daemons running just to work. If you're using something like Python you just frankly don't have very good packaging. So it's hard to string together all that into a solution and have it magically work for everyone. What YouTube-dl have done is good engineering. Even though it's not a full JS interpreter: they've kept their software lean, self-contained, and easier to use.

评论 #32796827 未加载

评论 #32796945 未加载

评论 #32794820 未加载

delusionalover 2 years ago

Can we stop the trend of linking to tweets that just contain another link to the content? what's the point? Wouldn't this be 10x better if it was a link directly to the github?

评论 #32796099 未加载

评论 #32795579 未加载

评论 #32796168 未加载

sylwareover 2 years ago

Nowadays "javascript" refers to the scriptable, grotesquely and absurdely complex and massive web engines, aka google financed blink and geeko, then apple financed webkit, that with their SDK.The currently obfuscated javascript media players will try to break yt-dlp by leveraging the complexity and size of those scripted web engines. They will make them out of reach to small teamns or individuals and it is even "better", it will force ppl to use apple or google web engine, killing any attempt to provide a real alternative.A standalone javascript interpreter is actually some work, but seems to stay in the "reasonable" realm: look at quickjs from M. Bellard and friends (the guy who created qemu, ffmpeg, tinycc, etc): plain and simple C (no need of a c++ compiler), doing the job more that well enough.That's why noscript/basic (x)html is so much important.

评论 #32793627 未加载

评论 #32793644 未加载

评论 #32796368 未加载

评论 #32793893 未加载

评论 #32794290 未加载

评论 #32794058 未加载

esprehnover 2 years ago

This isn't really JS, it's a purpose built evaluator that's only for evaluating a particular script on YouTube, assuming a huge list of things are true about how YouTube JS is written.Ex. Its got a hard coded list of methods for String, and it doesn't respect prototypes. It only supports creating Date instances, and won't work if you override the global Date. It parses with regexes and implements all operators with python's operator module (which is the wrong type semantics) etc. Nearly none of the semantics of JS are implemented.It's sort of the sandwich categorization problem:If I write a C# "interpreter" in perl thats only 200 lines and just handles string.Join, string.Concat and Console.WriteLine, and it doesn't actually try to implement C# syntax or semantics at all and just uses perl semantics for those operations is it actually C#? :PI say "not a sandwich".

评论 #32794222 未加载

评论 #32794443 未加载

评论 #32794103 未加载

评论 #32794391 未加载

评论 #32795311 未加载

评论 #32794124 未加载

haunterover 2 years ago

The same in yt-dlp <a href="https://github.com/yt-dlp/yt-dlp/blob/master/yt_dlp/jsinterp.py" rel="nofollow">https://github.com/yt-dlp/yt-dlp/blob/master/yt_dlp/jsinterp...</a>Interesting to see the diffcheck between the two <a href="https://www.diffchecker.com/8EJGN27K" rel="nofollow">https://www.diffchecker.com/8EJGN27K</a>

评论 #32794106 未加载

kristopolousover 2 years ago

To understand why, I have a far simpler tool that focuses on a subset of sites (adult content video aggregators)<a href="https://github.com/kristopolous/tube-get" rel="nofollow">https://github.com/kristopolous/tube-get</a>It too deals with this problem but does so in a way that'd be easy to maliciously sabotageLook right about here <a href="https://github.com/kristopolous/tube-get/blob/master/tube-get.py#L111" rel="nofollow">https://github.com/kristopolous/tube-get/blob/master/tube-ge...</a>As to why this program exists, this was originally written between about 2010-2015 or so technically predates the yt-* ecosystem.The tool still works fine and it's not a strict subset of yt-dlp or YouTube-dl because being a different approach, although it's overall site coverage is smaller, I've had it be a "second try" system when yt-* fails and it comes up with success maybe about half the time

评论 #32796519 未加载

aeyesover 2 years ago

They just don't want to use any external dependencies... There is also an AES implementation: <a href="https://github.com/ytdl-org/youtube-dl/blob/master/youtube_dl/aes.py" rel="nofollow">https://github.com/ytdl-org/youtube-dl/blob/master/youtube_d...</a>

M30over 2 years ago

How should a programming noob interpret this? Be impressed at what was achieved here? Be concerned about security implications using the tool? Something else entirely?

评论 #32795003 未加载

评论 #32793564 未加载

评论 #32793444 未加载

评论 #32795974 未加载

评论 #32795117 未加载

评论 #32794142 未加载

评论 #32793976 未加载

评论 #32796716 未加载

lewisl9029over 2 years ago

Another really cool JS dialect I recently learned about is njs from the nginx team: <a href="https://github.com/nginx/njs" rel="nofollow">https://github.com/nginx/njs</a>This video goes into some of the design and tradeoffs: <a href="https://www.youtube.com/watch?v=Jc_L6UffFOs" rel="nofollow">https://www.youtube.com/watch?v=Jc_L6UffFOs</a>TL;DW: they optimized for fast creation/destruction of low-footprint VMs with no JIT or garbage collection.

homarpover 2 years ago

the tests for it: <a href="https://github.com/ytdl-org/youtube-dl/blob/master/test/test_jsinterp.py" rel="nofollow">https://github.com/ytdl-org/youtube-dl/blob/master/test/test...</a>

olliejover 2 years ago

This is super cool.Some of the stuff is kind of questionable to me in the sense that I could believe you could probably make some kind of sufficiently wonky JS that this would do the "wrong" thing.But it's super cool that they are able to do this as I think it shows that claims of JS complexity based on the size of JS engines is overlooking just how much of that size/complexity comes from the "make it fast" drive vs. what the language requires. Here you have a <1000LoC implementation of the core of the JS language, removed from things like regex engines, GCs, etc.Mad props to them for even attempting it as well - it simply would not have ever occurred to me to say "let's just write a small JS engine" and I would have spent stupid amounts of time attempting to use JSC* from python instead.[* JSC appears to be the only JS engine with a pure C API, and the API and ABI are stable so on iOS/macOS at least you can just use the system one which reduces binary size+build annoyance. The downside is that C is terrible, and C++ (differently terrible? :D) APIs make for much more pleasant interfaces to the VM - constructors+destructors mean that you get automatic lifetime management so handles to objects aren't miserable, you can have templates that allow your API to provide handles that have real type information. JSC only has JSValueRef and JSObjectRef, and as a JSObjectRef is a JSValueRef it's actually just a typedef to const JSValueRef :D OTOH other hand I do thing JSC's partially conservative GC is better for stack/temporary variables is superior to Handles for the most part, but it's also absolutely necessary to have an API that isn't absolutely wretched. The real problem with JSC's API is that it has not got any love for many many many .... many years so it doesn't have any way to handle or interact with many modern features without some kludgy wrappers where you push your API objects into JS and have the JS code wrap them up. The API objects are also super slow, as they basically get treated as "oh ffs" objects that obey no rules. I really do wish it would get updated to something more pleasant and really usable.]

评论 #32794092 未加载

jraphover 2 years ago

I do wonder why YouTube does not try harder to make it difficult to do this computation meant to prove you are a legit YouTube web client. Providing an easy-to-find, simple JS function interpretable with 900 lines of Python is like they don't try at all. They might as well do nothing.Or is their goal just to make youtube-dl not 100% reliable? Or to be able to say "look, you are running our code in a way we did not intend, you can't do this because you are breaking the EULA"?

评论 #32794860 未加载

评论 #32794512 未加载

评论 #32794758 未加载

mdanielover 2 years ago

I was expecting this to be about Duktape <<a href="https://github.com/svaarala/duktape" rel="nofollow">https://github.com/svaarala/duktape</a>>, but heh, for sure no. I'd bet $1 there's no way youtube-dl would switch, but I wonder if yt-dlp would?

rcarmoover 2 years ago

Awesome. Even if it's likely incomplete, it might come in really handy for some scraping I need to do...

Tooover 2 years ago

They must have been inspired by this PyCon presentation, where David Beazley live codes a fully working webassembly interpreter, in under one hour. <a href="https://youtu.be/VUT386_GKI8" rel="nofollow">https://youtu.be/VUT386_GKI8</a>

atan2over 2 years ago

This seems to be a pretty small subset of JavaScript, but I personally love small projects like this for educational purposes. Removing the noise and keeping things minimal helps my brain reason about things.Earlier this year I enrolled in an online class called "Building a Programming Language" taught by Roberto Ierusalimschy (creator of Lua) and Gustavo Pezzi (creator of pikuma.com). We created a toy language interpreter/VM and the final code was around of 1,800 lines of Lua code. Keeping things as simple (and sometimes naive) as possible was definitely the right choice for me to really wrap my head around the basic theory and connect the dots.Thanks for the link.

Tao3300over 2 years ago

Greenspun's Tenth Rule:> Any sufficiently complicated C or Fortran program contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of Common Lisp. [1]And here we have a complicated Python program with a partial JS implementation in it.[1] <a href="https://en.wikipedia.org/wiki/Greenspun's_tenth_rule" rel="nofollow">https://en.wikipedia.org/wiki/Greenspun's_tenth_rule</a>

anony23over 2 years ago

What purpose does it serve?

评论 #32793307 未加载

评论 #32793309 未加载

评论 #32793320 未加载

评论 #32793304 未加载

评论 #32793295 未加载

评论 #32793291 未加载

tonethemanover 2 years ago

If this got much bigger I would switch it to quickjs