Before jumping into frameworks, if your data is lucky enough to be stored in an html table:<p><pre><code> import pandas as pd
dfs = pd.read_html(url)
</code></pre>
Where ‘dfs’ is an array of dataframes - one item for each html table on the page.<p><a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html" rel="nofollow">https://pandas.pydata.org/pandas-docs/stable/reference/api/p...</a>
I've been involved in many web scraper jobs over the past 25 years or so. The most recent one, which was a long time ago at this point, was using scrapy. I went with XML tools for controlling the DOM.<p>It's worked unbelievably well. It's been running for roughly 5 years at this point. I send a command at a random time between 11pm and 4am to wake up an ec2 instance. It checks its tags to see if it should execute the script. If so, it does so. When it's done with its scraping for the day, it turns itself off.<p>This is a tiny snapshot of why it's been so difficult for me to go from python2 to python3. I'm strongly in the camp of "if it ain't broke, don't fix it".
One tip I would pass on when trying to scrape data from a website, start by using wget in mirror mode to download the useful pages. It's much faster to iterate on scraping the data once you have it locally. Also, less likely to accidentally kill the site or attract the attention of the host.
My last contract job was to build a 100% perfect website mirroring program for a group of lawyers who were interested in building class action lawsuits against some of the more henious scammers out there.<p>I ended up building like 8 versions of it, literally using every PHP and Python library and resource I could find.<p>I tried httrack, php-ultimate-web-scraper (from github), headless chromium. headless selenium, and a few others<p>By far the biggest problem was dealing with JS links...you wouldn't think from the start it would be such a big deal but yet..it was.<p>Selenium with python turned out to be the winning combination, and of course, it was the last one I tried. Also, this is an ideal project to implement recursion altho you have to be careful about exit conditions.<p>One thing that was VERY important for performance was not visiting any page more then once because, obviously, certain links in headers and footers are duped sometimes 100s of times.<p>JS links often made it very difficult to discover the linked page, are certain library calls that were supposed to get this info for you often didn't work.<p>It was a super fun project, and in the end considering I only worked for 2 months, I shipped some decent code that was getting like 98.6% of the pages perfectly.<p>The final presentation was interesting...for some reason my client I think got in his head that I wasn't very good programmer or something, and as we ran thru his list of sample sites expecting my program to error out or incorrectly mirror the site, but it handled all 10 of the sites about perfectly and he was rather flabbergasted because he told me it would have taken him a week hand clicking the site for the mirror but instead the program did them all in under an hour.
I think this article does an OK job covering how to scrape websites rendered serverside, but I strongly discourage people from scraping SPAs using a headless browser unless they absolutely have to. The article's author touches on this briefly, but you're far better off using the network tab in your browser's debug tools to see what AJAX requests are being made and figuring out how those APIs work. This approach results in far less server load for the target website as you don't need to request a bunch of other resources, reduces the overall bandwidth costs, and greatly speeds up the runtime of your script since you don't need to spend time running javascript in the headless browser. That can be especially slow if your script has to click/interact with elements on the page to get the results you need.<p>Other than that, I'd strongly caution anyone looking into making parallel requests. Always keep in mind the sysadmin and engineers behind the site you are targeting. It's can be tempting to value your own time by making a ton of parallel requests to reduce the overall time of your script, but you can potentially cause massive server load for the site you're targeting. If that isn't enough motivation to cause you pause, keep in mind that the site owner is more likely to make the site hostile to scrapers if there are too many bad actors hitting the site heavily.
It’s fun to combine jupyter notebooks and py scraping. If you are working 15 pages/screens deep, you can “stay at the coal face” and not have to rerun the whole script after making a change to the latest step.
I wanted to do some larger distributed scraping jobs recently and although it was easy to get everything running on one machine (with different tools including Scrapy), I was surprised how hard it was to do at scale. The open source ones I could find was hard/impossible to get working, overly complex, badly documented etc.<p>The services I found to be reasonably priced for small jobs, but at scale they quickly become vastly more expensive than setting this up yourself. Especially when you need to run these jobs every month or so. Even if you have to write some code to make the open source solutions actually work.
One thing I notice with all blog articles, and HN comments, on scraping is that they always omit the actual use case, i.e., the specific website that someone is trying to scrape. Any examples tend to be so trivial as to be practically meaningless. They do not prove anything.<p>If authors did name websites they wanted to scrape, or show tests on actual websites, then we might see others come forward with different solutions. Some of them might beat the ones being put forward by the pre-packaged software libraries/frameworks and commercial scraping services built on them, e.g., less brittle, faster, less code, easier to repair.<p>We will never know.
In my career I found several reasons not to use regular expressions for parsing an HTML response, but the largest was the fact that it may work for 'properly formed' documents, but you would be surprised how lax all browsers are about requiring the document to be well-formed. Your regex, unless particularly handled, will not be able to handle sites like this (and there are a lot, at least from my career experience). And you may be able to work 'edge cases' into your RegEx, but good luck finding anyone but the expression author who fully understands and can confidently change it as time goes on. It is also a PITA to debug when groupings/etc. aren't working (and there will be a LOT of these cases with HTML/XML documents).<p>It is honestly almost never worth it unless you have constraints on what packages you can use and you MUST use regular expressions. Just do your future-self a favor and use BeautifulSoup or some other package designed to parse the tree-like structure of these documents.<p>One way it can be used appropriately is just finding a pattern in the document- without caring where it is w.r.t. the rest of the document. But even then, do you really want to match: <!-- <div> --> ?
I have been developing scrapers and crawlers and writing[1] about them for many years and used many Python based libs so far including Selenium. I have write such scrapers for individuals and startups for several purposes. The biggest issue I faced was rendering of dynamic sites and blocking of IPs due to absence of proxies which are not cheap at all, especially for individuals.<p>Services like Scrapingbee and ScraperAPI are serving quite good for such problems. I personally liked ScraperAPI for rendering dynamic websites due to the better response time.<p>Shameless Plug: In case if anyone is interested, long time back, I had written about it on my blog which you can read here[2]. Now you do not need to setup remote Chrome instance or anything. What all is required is to hit an API endpoint to fetch content from a dyanmic JS rendered websites.<p>[1] <a href="http://blog.adnansiddiqi.me/tag/scraping/" rel="nofollow">http://blog.adnansiddiqi.me/tag/scraping/</a><p>[2] <a href="http://blog.adnansiddiqi.me/scraping-dynamic-websites-using-scraper-api-and-python/" rel="nofollow">http://blog.adnansiddiqi.me/scraping-dynamic-websites-using-...</a>
Been scraping for a long time. If handling JS isn't a requirement, XPath is the 100% the way to go. It's a standard query language, very powerful, and there are great browser extensions for helping you write queries.
fetching html and then parsing it navigating the parsed result (or with regexp) is what used to work 20 years ago.
These days, with all these reactive javascript frameworks you better skip to item number 5: headless browsing.
Also mind that Facebook, Instagram, ... will have anti-scraping measures in place. It's a race ;)
Is web scraping going to continue to be a viable thing, now that the web is mainly an app delivery platform rather than a content delivery platform?<p>Can you scrape a webasm site?
I have been websraping for almost 4 years now. That is my entire niche.<p>The problem with web scraping is that you really don't know the ethical point of scraping ends. These days I will reverse engineer a website to minimize the request load and only target specific API endpoints. But then again I am breaching some security measures they have while doing that.
I do web scraping for fun and profit, primarily using Python. Wrote a post some time back about it.<p><a href="https://www.kashifaziz.me/web-scraping-python-beautifulsoup.html/" rel="nofollow">https://www.kashifaziz.me/web-scraping-python-beautifulsoup....</a>
Is there a SOTA library for common web scraping issues at scale( especially distributed over cluster of nodes) for Captcha detection, IP rotation, Rate throttling, Queue Management etc.?
Personally I have not needed Beautifulsoup a single time, when web scraping. People say it is better for unclean HTML, which I cannot confirm, because I never needed it and always were able to get my result using LXML + etree with XPath and CSS selectors. Once I also used Scrapy, but still not Beautifulsoup. I am glad there is a guide, that starts with LXML, instead of immediately jumping to Beautifulsoup.
We created a fun side project to grab the index page of every domain - we downloaded a list of approx 200m domains. However, we ran into problems when our provider complained. It was something to do with the DNS side of things and we were told to run our own DNS server. If there is anyone on here with experience of crawling across this number of domain names it would be great to talk!
The biggest struggle I had while building web scrappers is scaling Selenium. If you need to launch Selenium hundred of thousands times per month, you need a lot of computer power which is really expensive on EC2.<p>A couple years ago, I discovered browserless.io which does this job for you and it's amazing. I really don't know how they made this but it just scales without any limit.
I recently undertook my first scraping project, and after trying a number of things landed upon Scrapy.<p>It’s been a blessing. Not only can it handle difficult sites, but it’s super quick to write another spider for the easy sites that provide the JSON blob in a handy single API call.<p>Only problem I had was getting around cloudflare, tried a few things like puppeteer but no luck.
For data extraction I highly recommend weboob. Despite the unfortunate name, it does some really cool stuff. Writing modules is quite straightforward and the structure they've chosen makes a lot of sense.<p>I do wish there was a Go version of it, mostly because I much prefer working with Go, but also because single binary is extremely useful.
I really appreciate the tips in the comments here.<p>As a beginner it makes a lot of sense to iterate on a local copy with jupyter rather than fetching resources over and over until you get it right. I wish more tutorials focused on this workflow.
I've always had pretty bad experiences with web scrapping, it's such a pain in the ass and frequently breaks. I'm not sure if I'm doing it wrong or if that's how it's supposed to be.
Aside from the Beautiful Soup library, is there something about Python that makes it a better choice for web scraping than languages such as Java, JavaScript, Go, Perl or even C#?
Does anyone know how could I script Save Page WE extension in Firefox? It does a really nice job of saving the page as it looks, including dynamic content.
PyPpeteer might be worth a look as well. Basically a port of the JS puppeteer project that drives headless Chrome via the Devtools API.<p>As mentioned elsewhere, using anything other than headless isn't useful beyond a fairly narrow scope these days.<p><a href="https://github.com/pyppeteer/pyppeteer" rel="nofollow">https://github.com/pyppeteer/pyppeteer</a>