How to use undocumented web APIs

239 pointsby pingiunabout 3 years ago

26 comments

kallabout 3 years ago

When they have a GraphQL API with introspection enabled, it feels like discovering a pot of gold.This happens more often than you would expect, even without any auth sometimes. At that point you're basically developing with the same DX as internal developers.My theory is people just turn off the GraphiQL endpoint on their GraphQL server and think they have hidden the schema, not realizing any external tool can do the introspection. Either that or it's developers slipping a little something under the radar for other developers (same thing with source maps).Another tip: If the service in question has a mobile app, sniffing the traffic on that with a MITM proxy can yield more interesting results than a web app.

评论 #30633262 未加载

评论 #30633282 未加载

评论 #30633333 未加载

评论 #30633626 未加载

评论 #30632438 未加载

评论 #30633917 未加载

dec0dedab0deabout 3 years ago

I said this on a thread complaining about SPAs a little bit ago, but I love that the SPA trend has caused all kinds of web apps to open up APIs to their users. It's not as fun as pure screen scraping, but it is very exciting when you figure out whatever weird behavior they're expecting, and it starts working.If you get stuck, look at their javascript, see what it is doing. double check your network requests in developer tools, some of them might be more important than you think, plus it's so nice that we don't have to use burp for this anymore. Some sites check referrers, and user agents, or expect a field from a specific server rendered page to be added to a header. More than one expected a javascript style timestamp on every request.The weirdest behavior comes from older apps that started as purely server rendered, and slowly added a dynamic frontend. I always cringe when it's obvious that different developers were given tasks over the years, and completed them without bothering to learn the rest of the system.

评论 #30633929 未加载

评论 #30634358 未加载

评论 #30637617 未加载

评论 #30632263 未加载

benmmurphyabout 3 years ago

> I think there’s literally no way for the backend to tell that the request isn’t sent by my browser and is actually being sent by a random Python program.Oh my sweet summer child. Unfortunately, there is a whole industry built around this. This a great blog discussing different detection methods: <a href="https://incolumitas.com/" rel="nofollow">https://incolumitas.com/</a>

评论 #30631528 未加载

评论 #30631588 未加载

评论 #30631147 未加载

评论 #30632651 未加载

评论 #30634059 未加载

评论 #30633295 未加载

cameroncairnsabout 3 years ago

Really great techniques listed in this thread! I wanted to point out though that it's generally nicer to the website owner if you enable `Accept-Encoding: gzip, deflate`. The difference in the amount of bandwidth charges for the site owner is quite significant, especially should you want to do comprehensive crawls.Yes, go ahead and disable that header when piping curl's output into `less`, however when converting the curl request into python just remember to re-add that header. Pretty much every python library I've used to handle web requests will automatically unzip the response from the server so you don't need to futz about with the zipping/unzipping logic yourself.

评论 #30635543 未加载

评论 #30635879 未加载

captn3m0about 3 years ago

I love doing this, especially to liberate content that is locked away in a app-only world otherwise. That's one important usecase that I'd love more people to work on - it is a great way to start with reverse-engineering, and building simple websites.Pro-tip: If the undocumented API has a "CORS:*" header, you can call these APIs directly from the browser on your domain, without having to proxy them or using curlAs an example, I published <a href="https://captnemo.in/plugo/" rel="nofollow">https://captnemo.in/plugo/</a> this week that calls the Plugo.io private API (the ones used by the mobile app) to fetch the data, and publish it using GitHub Pages. The data is just a list of places where Plugo provides powerbanks on rent (500+ locations, mostly concentrated across 3 Indian cities, and 2 places in Germany somehow). I'm running a simple curl command on a scheduled GitHub Action that commits back to itself so the data remains updated.I similarly did this to make a nocode frontend for another "clubhouse-alternative" which would keep recordings, but only provide them in-app. A friend wanted to listen to his prior recordings, but the app was too cumbersome, so I made a alternative frontend that would call the private API, and render a simple table with MP4 links for all recordings.I even use this as a "nocode testing ground"[1] for many of the new nocode apps in the market - seeing if they are feasible enough to build fully functional frontends on top of existing APIs (which would be great for someone like me).As a bonus, this works as a alternative-data stream for i)Plugo's Growth Metrics, if you were a investor, or interested in the "rent-powerbank" space as well as ii)Finding out cool new places to visit around you.[1]: <a href="https://news.ycombinator.com/item?id=29243536" rel="nofollow">https://news.ycombinator.com/item?id=29243536</a>

评论 #30635355 未加载

1vuio0pswjnm7about 3 years ago

"The answer is sort of yes - browsers aren't magic! All the information browsers send to your backend is just HTTP requests. So if I copy all of the HTTP headers that my browser is sending, I think there's literally no way for the backend to tell that the request isn't sent by my browser and is actually being sent by a random Python program."There is a way.^1 One might need to copy the static elements of the TLS Client Hello in addition to certain HTTP headers.1. <a href="https://blog.squarelemon.com/tls-fingerprinting/" rel="nofollow">https://blog.squarelemon.com/tls-fingerprinting/</a>See, e.g., <a href="https://github.com/refraction-networking/utls" rel="nofollow">https://github.com/refraction-networking/utls</a>"problem 1: expiring session cookiesOne big problem here is that I'm using my Google session cookie for authentication, so this script will stop working whenever my browser session expires.That means that this approach wouldn't work for a long running program (I'd want to use a real API), but if I just need to quickly grab a little bit of data as a 1-time thing, it can work great!"Sometimes Google keeps users logged in. For example, session cookies in Gmail will last for months or more. This makes it easy to check Gmail from the command line without a browser. It also means if someone steals a session cookie and the user never logs out, e.g., she closes the browser without logging out first,^2 then the thief can access the account for months, or longer.2. Of course, it is also possible to logout and disable specific session cookies from the command line, without a browser.

评论 #30635261 未加载

isbvhodnvemrwvnabout 3 years ago

I would also add that any search boxes are typically keys to the kingdom if you're scraping shops/job boards or similar things. They are often not hardened, so you can file e.g. an empty query (even if frontend doesn't allow it), or effectively disable pagination by requesting 1000000 results per page.

01acheruabout 3 years ago

I need to point something out to people doing that kind of thing to other people wesites/webapps/whatever:Having done this multiple times be aware that you can break other people stuff by messing up requests. Most web APIs suck and some won't behave nicely on unexpected failures.1. When trying to automate a process on an energy management platform I ended up creating resources under some kind of master account, some things broke and they had to manually clean the DB.2. When trying to access an operation I couldn't do via the provided API I reverse engineered the API of their admin dashboard. It sucked really bad, with a lot of strange sync tokens that felt like going back to 20 years ago. Anyway my implementation wasn't perfect, it grinded their platform to a halt.I could go on, so please just do stuff like that if you're in contact with the people on the other side. If you're not limit yourself to GETs.

评论 #30634272 未加载

octoberfranklinabout 3 years ago

> there’s literally no way for the backend to tell that the request isn’t sent by my browser and is actually being sent by a random Python program.This is wrong, and the fact that somebody clearly experienced in web development is totally unaware that it is wrong should be a clear sign of the danger.For starters: TLS fingerprinting, ETAG fingerprinting (including subtle browser-to-browser changes in how ETAGs are cached and evicted), JS VM fingerprinting, timing side channels, there is a massive list here. And then there's wasm...

cehrlichabout 3 years ago

Undocumented APIs are great when you only need to use them for a short amount of time, but if you try to build anything long term on top of them you should keep in mind that there could be changes that completely break your stuff, unannounced, at any time.

评论 #30633949 未加载

kjgkjhfkjfabout 3 years ago

It's more robust not to remove the extra headers IMO. Otherwise you give an unnecessary signal to the backend that the traffic's not coming from the expected sources.It also makes the process of writing your code more mechanical, which is useful since you'll likely have to redo the process when the API changes.

1vuio0pswjnm7about 3 years ago

"I usually just figure out which headers I can delete with trial and error - I keep removing headers until the request starts failing. In general you probably don't need Accept, Referer, Sec-, DNT, User-Agent, and caching headers though."IME, this "header minimisation" works for almost any website, or "endpoint". IOW, it is useful outside of "APIs". As a matter of practice, I minimise headers automatically with a forward proxy.^1Thus, one can send less data to "tech" companies and still receive the same results. We know that data received by "tech" companies is used at every opportunity to support surveillance and online advertising. The most well-known example is perhaps "fingerprinting". Given a choice between sending more data or less data to "tech" companies, what is the choice that, in the aggregate,^2 lends itself better to increased survelliance and online advertising.If the author here can send fewer headers and still get the desired result, then it stands to reason sending those extra headers benefits someone else besides the user. Send more data, not less, to make surveillance and online advertising easier. "Tech" companies will often defend data collection by suggesting that data supplied in headers are being used to "improve the user experience" or some such, and this may well be true for many cases, but the "fingerprinting" example exemplifies how there can also be another purpose. Data can be multi-purpose.1. An added benefit is one does not need to fiddle with the browser to copy HTTP headers^3 as they are all easily accessible in the proxy logs.2. Here, "in the aggregate" means "if every user makes the same choice".3. The online advertising company or its business partner (e.g., Mozilla) could change the browser, without notice, at any time.

SahAssarabout 3 years ago

I do a bit of scraping for hobby projects, and much of that comes down to basically this (but I do it in node instead of python). Sometimes you need to use jsdom or puppeteer, but the second step (after checking if there are official data dumps made available or some official API) is always checking the full data flow in devtools if there is some undocumented way to more quickly get the raw data I want.

simonwabout 3 years ago

A trick that works great for me: filter the browser network pane by XHR, then sort by size - this usually ends up with the most interesting JSON responses listed at the top.

gfdabout 3 years ago

I found puppeteer very nice to script against if you need a real headless browser:<a href="https://github.com/puppeteer/puppeteer" rel="nofollow">https://github.com/puppeteer/puppeteer</a>

评论 #30631805 未加载

评论 #30631976 未加载

评论 #30631279 未加载

评论 #30637637 未加载

helsinkiabout 3 years ago

You would be surprised to find out that some web servers are capable of detecting browser emulation through curl or Python’s requests lib. Try programmatically scrolling through Instagram photos. It will work if you use curl, but it will not work using Python’s requests lib. Not sure how they detect it - maybe related to timing of packets.

评论 #30630754 未加载

评论 #30631046 未加载

评论 #30638947 未加载

theblazehenabout 3 years ago

If you still use the website via browser, I find <a href="https://github.com/richardpenman/browsercookie/" rel="nofollow">https://github.com/richardpenman/browsercookie/</a> is great for working around the expiring cookie problem

don-codeabout 3 years ago

While I've successfully used this method for public APIs, I ran into an interesting one not long ago: where authentication is performed _by IP address_.I have a switch (I think a TP-Link TL-SG1016PE) with PoE - and a finnicky PoE device that periodically needs a reboot, so I figured I'd replay turning the port on and off in the Web interface. Notably, logging in does not issue me any authentication token, but I can still turn the port on and off - and can still do it via `curl`, too. But as soon as I try it on another machine? Access denied!(Yes, I could just fake the login process the same way, but that was more work than I had time for.)

burnishedabout 3 years ago

>>If I’m using a small website, there’s a chance that my little Python script could take down their service because it’s doing way more requests than they’re able to handle. So when I’m doing this I try to be respectful and not make too many requests too quickly.What is a reasonable rate to send requests? I've done a little scraping and I wanted to do the same thing but I realized I had no idea what would be considered acceptable use and what would be unacceptable. If anyone has a heuristic they like to use I'm all ears.

评论 #30639174 未加载

getcrunkabout 3 years ago

So I just checked WhatsApp web app. No network activity whats so ever on full loaded page that has incoming messages. And then a bunch of error messages in console about we sockets and source maps. How did they pull that off? does chrome not show web socket activity or service worker activity on the network tab?

评论 #30632719 未加载

joshstrangeabout 3 years ago

It's always a joy when you start to reverse engineer an undocumented API and find out it is cleaner/nicer than some paid APIs you've used. Paprika (Cloud sync for the recipes/other data) was an example of that for me. Their API is (was, it's been a minute since I last looked at it) super RESTful and really easy to reason about, more less just simple CRUD.

slaymaker1907about 3 years ago

The copy as cURL is a great idea! That makes it easy to get a succinct summary of the components to the request including how they are doing auth. If the API in question is a desktop app, Fiddler can be a great alternative. Obviously WireShark can see more, but Fiddler is a lot easier to use and setup in my experience.

moron4hireabout 3 years ago

Small nitpick on the comments about removing the headers that the browser request had made.You probably don't want Accept: */*. If the value of Accept is anything other than */*, then you probably want it.

jeffrallenabout 3 years ago

Julia is really an excellent teacher.

tkanarskyabout 3 years ago

I used this approach last year to run a Twitter bot that would report when local pharmacies had 'rona vaccine appointments open up. I scraped the API's of CVS, Rite-Aid, Walgreens, and a few other chains this way. Although I didn't go fancy and try to distill the API down to the bare minimum headers, I just called into cURL from Python with that giant command as a string.

ipnonabout 3 years ago

gobuster is an effective way to enumerate subdomains and their directories quickly.<a href="https://github.com/OJ/gobuster" rel="nofollow">https://github.com/OJ/gobuster</a>

26 comments

kallabout 3 years ago

评论 #30633262 未加载

评论 #30633282 未加载

评论 #30633333 未加载

评论 #30633626 未加载

评论 #30632438 未加载

评论 #30633917 未加载

dec0dedab0deabout 3 years ago

评论 #30633929 未加载

评论 #30634358 未加载

评论 #30637617 未加载

评论 #30632263 未加载

benmmurphyabout 3 years ago

评论 #30631528 未加载

评论 #30631588 未加载

评论 #30631147 未加载

评论 #30632651 未加载

评论 #30634059 未加载

评论 #30633295 未加载

cameroncairnsabout 3 years ago

评论 #30635543 未加载

评论 #30635879 未加载

captn3m0about 3 years ago

评论 #30635355 未加载

1vuio0pswjnm7about 3 years ago

评论 #30635261 未加载

isbvhodnvemrwvnabout 3 years ago

01acheruabout 3 years ago

评论 #30634272 未加载

octoberfranklinabout 3 years ago

cehrlichabout 3 years ago

评论 #30633949 未加载

kjgkjhfkjfabout 3 years ago

1vuio0pswjnm7about 3 years ago

SahAssarabout 3 years ago

simonwabout 3 years ago

A trick that works great for me: filter the browser network pane by XHR, then sort by size - this usually ends up with the most interesting JSON responses listed at the top.

gfdabout 3 years ago

I found puppeteer very nice to script against if you need a real headless browser:<a href="https://github.com/puppeteer/puppeteer" rel="nofollow">https://github.com/puppeteer/puppeteer</a>

评论 #30631805 未加载

评论 #30631976 未加载

评论 #30631279 未加载

评论 #30637637 未加载

helsinkiabout 3 years ago

评论 #30630754 未加载

评论 #30631046 未加载

评论 #30638947 未加载

theblazehenabout 3 years ago

don-codeabout 3 years ago

burnishedabout 3 years ago

评论 #30639174 未加载

getcrunkabout 3 years ago

评论 #30632719 未加载

joshstrangeabout 3 years ago

slaymaker1907about 3 years ago

moron4hireabout 3 years ago

jeffrallenabout 3 years ago

Julia is really an excellent teacher.

tkanarskyabout 3 years ago

ipnonabout 3 years ago

gobuster is an effective way to enumerate subdomains and their directories quickly.<a href="https://github.com/OJ/gobuster" rel="nofollow">https://github.com/OJ/gobuster</a>