Cute. That's what BeautifulSoup is good for.<p>I'm a longtime user of BeautifulSoup.<p>BeautifulSoup does not use or create "the DOM". It does convert HTML into a tree, but that tree is somewhat different from a browser's Document Object Model. For most screen-scraping purposes, this doesn't matter. But if the page uses Javascript to manipulate the DOM on page load, it does.<p>I have a tool for looking at a web page through BeautifulSoup. This reads the page from a server, parses it into a tree with BeautifulSoup using the HTML5 parser, discards all Javascript, makes all links absolute, and turns the tree back into HTML in UTF-8, properly indented. If you run a page through this and it still makes sense, scraping will probably work. If not, simple scraping won't work and you'll probably have to use a program-controlled browser that will execute JavaScript.<p>Some examples:<p>HN looks good: <a href="http://www.sitetruth.com/fcgi/viewer.fcgi?url=https://news.ycombinator.com" rel="nofollow">http://www.sitetruth.com/fcgi/viewer.fcgi?url=https://news.y...</a><p>AFL-CIO, the site used in the article, looks great: <a href="http://www.sitetruth.com/fcgi/viewer.fcgi?url=aflcio.org" rel="nofollow">http://www.sitetruth.com/fcgi/viewer.fcgi?url=aflcio.org</a><p>Twitter's images disappear: <a href="http://www.sitetruth.com/fcgi/viewer.fcgi?url=https://www.twitter.com" rel="nofollow">http://www.sitetruth.com/fcgi/viewer.fcgi?url=https://www.tw...</a><p>Adobe's formatting disappears: <a href="http://www.sitetruth.com/fcgi/viewer.fcgi?url=https://www.adobe.com" rel="nofollow">http://www.sitetruth.com/fcgi/viewer.fcgi?url=https://www.ad...</a><p>Intel complains about the browser but looks OK: <a href="http://www.sitetruth.com/fcgi/viewer.fcgi?url=intel.com" rel="nofollow">http://www.sitetruth.com/fcgi/viewer.fcgi?url=intel.com</a><p>Grubhub gives us nothing as plain HTML: <a href="http://www.sitetruth.com/fcgi/viewer.fcgi?url=grubhub.com" rel="nofollow">http://www.sitetruth.com/fcgi/viewer.fcgi?url=grubhub.com</a><p>Same for Doordash: <a href="http://www.sitetruth.com/fcgi/viewer.fcgi?url=doordash.com" rel="nofollow">http://www.sitetruth.com/fcgi/viewer.fcgi?url=doordash.com</a><p>(No scraping restaurant menus with BeautifulSoup.)<p>Cool stuff in pure CSS works fine: <a href="http://www.sitetruth.com/fcgi/viewer.fcgi?url=css3.bradshawenterprises.com/cfimg/" rel="nofollow">http://www.sitetruth.com/fcgi/viewer.fcgi?url=css3.bradshawe...</a><p>(You don't really need Javascript any more just to get the page up.)