Web scraping with your web browser: Why not?

150 点作者 8chanAnon7 个月前

Includes working code. First article in a planned series.

26 条评论

> can you write a web scraper in your browser? The answer is: YES, you can! So why is nobody doing it?Completely agree with this sentiment.I just spent the last couple of months developing a chrome extension, but recently also did an unrleated web scraping project where I looked into all the common tools like beautiful soup, selenium, playwright, pupeteer, etc, etc.All of these tools were needlessly complicated and I was having a ton of trouble with sites that required authentication. I then realized it would be way easier to write some javascript and paste it in my browser to do the scraping. Worked like a charm!

评论 #41722758 未加载

评论 #41714661 未加载

smallerfish7 个月前

I wrote a prototype of a browser extension that scraped your bookmarks + 1 degree, and indexed everything into an in-memory search index (which gets persisted in localstorage). I took over the new tab page with a simple search UI, with instant type-ahead search.Rough aspects:a) It requires a _lot_ of browser permissions to install the extension, and I figured the audience who might be interested in their own search index would likely be put off by intrusive perms.b) Loading the search index from localstorage on browser startup took 10-15s with a moderate number of sites; not great. Maybe would be a fit for pouchdb or something else that makes IndexedDB tolerable. (Or wasm sqllite, if it's mature enough.)c) A lot of sites didn't like being scraped (even with rate limiting and back-off), and I ended up being served an annoying number of captchas in my regular everyday browsing.d) Some walled garden sites seem completely unscrapable (even in the browser) - e.g. Linkedin.

评论 #41715555 未加载

评论 #41715482 未加载

评论 #41715562 未加载

gmac7 个月前

Yes: I find it surprising that this isn't a more widespread approach. It's how I've taught web scraping to my PhD students for some years.<a href="https://github.com/jawj/web-scraping-for-researchers">https://github.com/jawj/web-scraping-for-researchers</a>

评论 #41724430 未加载

hildenae7 个月前

I understand that "with/in your web browser" implies a extention or simmilar, but i have good experience using Selenium and Python to scrape websites. Some sites are trickier than others, and when you are instrumenting a browser it easily triggers bot prevention, but you are also able to easily scrape pages that build the DOM using JS and simmilar. I have considered, but not looked into compiling my own Firefox to disable i.e. navigator.webdriver, but it feels a bit to much work.This is my project for extracting my (your) webshop order & item data <a href="https://gitlab.com/Kagee/webshop-order-scraper" rel="nofollow">https://gitlab.com/Kagee/webshop-order-scraper</a>

simlan7 个月前

I also did something similar for my spring project. The idea was to buy a used car and I was frustrated with the BS the listing sites claimed as fair price etc..I went the browser extension route and used grease monkey to inject custom JavaScript. I patched the window.fetch and because it was a react page it did most of the work for me providing me with a slightly convolute JSON doc everytime I scrolled. Getting the data extracted was only a question of getting a flask API with correct CORS settings running.Thanks for posting using a local proxy for even more control could be helpful in the future.

评论 #41718864 未加载

linsomniac7 个月前

There is an extension called "Amazon Order History Reporter" that will scrape Amazon to download your order history. I've used it a couple times and it works brilliantly.

seanwilson7 个月前

> So the question is: can you write a web scraper in your browser? The answer is: YES, you can! So why is nobody doing it?> One of the issues is what is called CORS (Cross-Origin Resource Sharing) which is a set of protocols which may forbid or allow access to a web resource by Javascript. There are two possible workarounds: a browser extension or a proxy server. The first choice is fairly limited since some security restrictions still apply.I'm doing this for a browser extension that crawls a website from page to page checking for SEO/speed/security problems (<a href="https://www.checkbot.io/" rel="nofollow">https://www.checkbot.io/</a>). It's been flexible enough, and it's nice not to have to maintain and scale servers for the web crawling. <a href="https://browserflow.app/" rel="nofollow">https://browserflow.app/</a> is another extension I know of that does scraping within the browser I think, and other automation.

评论 #41727693 未加载

ggorlen7 个月前

I wrote a similar post on in-browser scraping: <a href="https://serpapi.com/blog/dynamic-scraping-without-libraries/" rel="nofollow">https://serpapi.com/blog/dynamic-scraping-without-libraries/</a>My approach is a step or two more automated (optionally using a userscript and a backend) and runs in the console on the site under automation rather than cross-origin, as shown in OP.In addition to being simple for one-off scripts and avoiding the learning curve of a Selenium, Playwright or Puppeteer, scraping in-browser avoids a good deal of potential bot detection issues, and is useful for constant polling a site to wait for something to happen (for example, a specific message or article to appear).You can still use a backend and write to file, trigger an email or SMS, etc. Just have your userscript make requests to a server you're running.

gabrielsroka7 个月前

Why do you need a proxy or to worry about CORS? Why not just point your browser to rumble.com and start from there?I've posted here about scraping for example HN with JavaScript. It's certainly not a new idea.2020: <a href="https://news.ycombinator.com/item?id=22788236">https://news.ycombinator.com/item?id=22788236</a>

评论 #41714635 未加载

ljw10047 个月前

In my web-scraping I've gravitated towards the "cheerio" library for javascript.I kind of don't want to use DOMParser because it's browser-only... my web-scrapers have to evolve every few years as the underlying web pages change, so I really want CI tests, so it's easiest to have something that works in node.

datadrivenangel7 个月前

I've been playing around with this idea lately as well! There are a lot of web interfaces that are hostile to scraping, and I see no reason why we shouldn't be able to use the data we have access to for our own purposes. CUSTOMIZE YOUR INTERFACES

flashgordon7 个月前

Ah I remember doing this almost 20 years ago and even rotating through 1500 proxies to not get tripped up by ddos detectors :). A plugin is one of the ways to scrape as it also looks like a human (ie more js run, more divs loaded and so on).

turingfeel7 个月前

If you want to get your personal IP and fingerprint blacklisted across major providers and large ranges, unfortunately this is how you do it. Just keep the rates low.

评论 #41717133 未加载

acheong087 个月前

I actually did that with a firefox extension + containers to scrape ChatGPT a long while back (before the APIs)<a href="https://github.com/acheong08/ChatGPT-API-agent">https://github.com/acheong08/ChatGPT-API-agent</a>Worked pretty well but browsers took up too much memory per tab so automating thousands of accounts (what i wanted) was infeasible

pimlottc7 个月前

When I have to do some really quick ad-hoc webscraping, I often just select all text on the page, copy it, and then switch to a terminal window where I build a pipeline that extracts the part I need (using pbpaste to access the clipboard). Very quick and dirty for when you just need to hit a few pages.

ricardo817 个月前

I've found using a local proxy helps when using Puppeteer and a proxy. The way chrome authenticates to a proxy keeps the connection open which can sometimes mess up rotating proxy endpoints, and having to close/re-open browsers per page is just too inefficient.

changing19997 个月前

> can you write a web scraper in your browser? The answer is: YES, you can! So why is nobody doing it?My guess would be that some companies are doing it (I worked at a major tech company that is/was), just not publicizing this fact as crawling/scraping is such a gray legal area.

chaosharmonic7 个月前

> You can find plenty of tutorials on the Internet about the art of web scraping... and the first things you will learn about are Python and Beautiful Soup. There is no tutorial on web scraping with Javascript in a web browser...Um... [0][0] <a href="https://bhmt.dev/blog/scraping" rel="nofollow">https://bhmt.dev/blog/scraping</a>

评论 #41713862 未加载

dewey7 个月前

I've read through that (hard to read, because of the bad formatting) but I still don't understand why you would do that instead of Playwright, Puppeteer etc. - The only reason seems to be "This technique certainly has its limits.".

评论 #41713906 未加载

评论 #41713499 未加载

spullara7 个月前

I love it when something like this reminds me of a project from forever ago...<a href="https://github.com/spullara/browsercrawler">https://github.com/spullara/browsercrawler</a>

nsonha7 个月前

sorry the format of this site is just too annoying for me to bother to read it. If this is about the shocking revelation that you can paste some code into the browser console, aka manually extracting information, then manually put that into whatever workflow that you need that information for, then I don't think that is called web scrapping, it's just browsing the web with code.

micahdeath7 个月前

Excel/Word Macro using a WebBrowser object in a Form (old IE did this nicely; Haven't done that since Edge came out.)

deisteve7 个月前

is there anything that runs on WASM for scraping? the issue is that you need to enable flags and turn off other security features to scrape on your web browser and this is why its not popular but with WASM that might change

评论 #41715004 未加载

ttshaw17 个月前

How is this different from scraping in, say, Selenium in non-headless mode?

评论 #41724089 未加载

welder7 个月前

Neo already did that in the Matrix:<a href="https://www.youtube.com/watch?v=sjoad6gcRzs" rel="nofollow">https://www.youtube.com/watch?v=sjoad6gcRzs</a>

squigz7 个月前

This horrendous color scheme makes it impossible for me to read this.