Cute. That's what BeautifulSoup is good for.<p>I'm a longtime user of BeautifulSoup.<p>BeautifulSoup does not use or create "the DOM". It does convert HTML into a tree, but that tree is somewhat different from a browser's Document Object Model. For most screen-scraping purposes, this doesn't matter. But if the page uses Javascript to manipulate the DOM on page load, it does.<p>I have a tool for looking at a web page through BeautifulSoup. This reads the page from a server, parses it into a tree with BeautifulSoup using the HTML5 parser, discards all Javascript, makes all links absolute, and turns the tree back into HTML in UTF-8, properly indented. If you run a page through this and it still makes sense, scraping will probably work. If not, simple scraping won't work and you'll probably have to use a program-controlled browser that will execute JavaScript.<p>Some examples:<p>HN looks good: <a href="http://www.sitetruth.com/fcgi/viewer.fcgi?url=https://news.ycombinator.com" rel="nofollow">http://www.sitetruth.com/fcgi/viewer.fcgi?url=https://news.y...</a><p>AFL-CIO, the site used in the article, looks great: <a href="http://www.sitetruth.com/fcgi/viewer.fcgi?url=aflcio.org" rel="nofollow">http://www.sitetruth.com/fcgi/viewer.fcgi?url=aflcio.org</a><p>Twitter's images disappear: <a href="http://www.sitetruth.com/fcgi/viewer.fcgi?url=https://www.twitter.com" rel="nofollow">http://www.sitetruth.com/fcgi/viewer.fcgi?url=https://www.tw...</a><p>Adobe's formatting disappears: <a href="http://www.sitetruth.com/fcgi/viewer.fcgi?url=https://www.adobe.com" rel="nofollow">http://www.sitetruth.com/fcgi/viewer.fcgi?url=https://www.ad...</a><p>Intel complains about the browser but looks OK: <a href="http://www.sitetruth.com/fcgi/viewer.fcgi?url=intel.com" rel="nofollow">http://www.sitetruth.com/fcgi/viewer.fcgi?url=intel.com</a><p>Grubhub gives us nothing as plain HTML: <a href="http://www.sitetruth.com/fcgi/viewer.fcgi?url=grubhub.com" rel="nofollow">http://www.sitetruth.com/fcgi/viewer.fcgi?url=grubhub.com</a><p>Same for Doordash: <a href="http://www.sitetruth.com/fcgi/viewer.fcgi?url=doordash.com" rel="nofollow">http://www.sitetruth.com/fcgi/viewer.fcgi?url=doordash.com</a><p>(No scraping restaurant menus with BeautifulSoup.)<p>Cool stuff in pure CSS works fine: <a href="http://www.sitetruth.com/fcgi/viewer.fcgi?url=css3.bradshawenterprises.com/cfimg/" rel="nofollow">http://www.sitetruth.com/fcgi/viewer.fcgi?url=css3.bradshawe...</a><p>(You don't really need Javascript any more just to get the page up.)
My web scraping toolkit (Python):<p>-Beautiful Soup<p>-Requests Lib<p>-JSON Lib<p>-Selenium<p>-Urllib2<p>-Cookielib<p>Handles 99% of things I encounter. With dynamic sites, you're often better off simulating the request than controlling the browser.<p>Then you get to parse your way through someone's annoyingly formatted JSON.
Absolutely love BeautifulSoup, I use it almost everyday in my job. There isn't really a scrapy vs. BS4 divide as you can still use the library with Scrapy, as opposed to it's standard parsing functionality. It also works well with lxml which is considerably faster.<p>It's also possible to build very performant and large scale crawlers with just BS4 and requests . Though managing the architecture is a bit of a pain, but it definitely can be done.<p>There are also a number of cases where it's better to use Bs4 than scrapy.<p>Also using PhantomJS, Selenium and BS4 can provide you with a very powerful data scraping solution.
I recently used BeautifulSoup in a Wikipedia scraping project. It's definitely a great tool but it had a few annoying functionality issues.<p>I had some preprocessing using decompose(), which can take lists of attributes. However it cannot decompose multiple attributes at once such as anchors with a certain title and table elements in one call. It felt cumbersome to call it multiple times.<p>It seems like Scrapy does exactly what I needed though (click on the first link in the article), which would have saved me a ton of headache. So it goes.