TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Web Scraping with Beautiful Soup (2014)

78 pointsby xcodingover 8 years ago

6 comments

Animatsover 8 years ago
Cute. That&#x27;s what BeautifulSoup is good for.<p>I&#x27;m a longtime user of BeautifulSoup.<p>BeautifulSoup does not use or create &quot;the DOM&quot;. It does convert HTML into a tree, but that tree is somewhat different from a browser&#x27;s Document Object Model. For most screen-scraping purposes, this doesn&#x27;t matter. But if the page uses Javascript to manipulate the DOM on page load, it does.<p>I have a tool for looking at a web page through BeautifulSoup. This reads the page from a server, parses it into a tree with BeautifulSoup using the HTML5 parser, discards all Javascript, makes all links absolute, and turns the tree back into HTML in UTF-8, properly indented. If you run a page through this and it still makes sense, scraping will probably work. If not, simple scraping won&#x27;t work and you&#x27;ll probably have to use a program-controlled browser that will execute JavaScript.<p>Some examples:<p>HN looks good: <a href="http:&#x2F;&#x2F;www.sitetruth.com&#x2F;fcgi&#x2F;viewer.fcgi?url=https:&#x2F;&#x2F;news.ycombinator.com" rel="nofollow">http:&#x2F;&#x2F;www.sitetruth.com&#x2F;fcgi&#x2F;viewer.fcgi?url=https:&#x2F;&#x2F;news.y...</a><p>AFL-CIO, the site used in the article, looks great: <a href="http:&#x2F;&#x2F;www.sitetruth.com&#x2F;fcgi&#x2F;viewer.fcgi?url=aflcio.org" rel="nofollow">http:&#x2F;&#x2F;www.sitetruth.com&#x2F;fcgi&#x2F;viewer.fcgi?url=aflcio.org</a><p>Twitter&#x27;s images disappear: <a href="http:&#x2F;&#x2F;www.sitetruth.com&#x2F;fcgi&#x2F;viewer.fcgi?url=https:&#x2F;&#x2F;www.twitter.com" rel="nofollow">http:&#x2F;&#x2F;www.sitetruth.com&#x2F;fcgi&#x2F;viewer.fcgi?url=https:&#x2F;&#x2F;www.tw...</a><p>Adobe&#x27;s formatting disappears: <a href="http:&#x2F;&#x2F;www.sitetruth.com&#x2F;fcgi&#x2F;viewer.fcgi?url=https:&#x2F;&#x2F;www.adobe.com" rel="nofollow">http:&#x2F;&#x2F;www.sitetruth.com&#x2F;fcgi&#x2F;viewer.fcgi?url=https:&#x2F;&#x2F;www.ad...</a><p>Intel complains about the browser but looks OK: <a href="http:&#x2F;&#x2F;www.sitetruth.com&#x2F;fcgi&#x2F;viewer.fcgi?url=intel.com" rel="nofollow">http:&#x2F;&#x2F;www.sitetruth.com&#x2F;fcgi&#x2F;viewer.fcgi?url=intel.com</a><p>Grubhub gives us nothing as plain HTML: <a href="http:&#x2F;&#x2F;www.sitetruth.com&#x2F;fcgi&#x2F;viewer.fcgi?url=grubhub.com" rel="nofollow">http:&#x2F;&#x2F;www.sitetruth.com&#x2F;fcgi&#x2F;viewer.fcgi?url=grubhub.com</a><p>Same for Doordash: <a href="http:&#x2F;&#x2F;www.sitetruth.com&#x2F;fcgi&#x2F;viewer.fcgi?url=doordash.com" rel="nofollow">http:&#x2F;&#x2F;www.sitetruth.com&#x2F;fcgi&#x2F;viewer.fcgi?url=doordash.com</a><p>(No scraping restaurant menus with BeautifulSoup.)<p>Cool stuff in pure CSS works fine: <a href="http:&#x2F;&#x2F;www.sitetruth.com&#x2F;fcgi&#x2F;viewer.fcgi?url=css3.bradshawenterprises.com&#x2F;cfimg&#x2F;" rel="nofollow">http:&#x2F;&#x2F;www.sitetruth.com&#x2F;fcgi&#x2F;viewer.fcgi?url=css3.bradshawe...</a><p>(You don&#x27;t really need Javascript any more just to get the page up.)
评论 #12935014 未加载
madenineover 8 years ago
My web scraping toolkit (Python):<p>-Beautiful Soup<p>-Requests Lib<p>-JSON Lib<p>-Selenium<p>-Urllib2<p>-Cookielib<p>Handles 99% of things I encounter. With dynamic sites, you&#x27;re often better off simulating the request than controlling the browser.<p>Then you get to parse your way through someone&#x27;s annoyingly formatted JSON.
pryelluwover 8 years ago
I enjoy using scrapy because it allows for a bit more functionality. Check it out if beautiful soup is too simplemente for your needs.
评论 #12933579 未加载
Ed10101over 8 years ago
Absolutely love BeautifulSoup, I use it almost everyday in my job. There isn&#x27;t really a scrapy vs. BS4 divide as you can still use the library with Scrapy, as opposed to it&#x27;s standard parsing functionality. It also works well with lxml which is considerably faster.<p>It&#x27;s also possible to build very performant and large scale crawlers with just BS4 and requests . Though managing the architecture is a bit of a pain, but it definitely can be done.<p>There are also a number of cases where it&#x27;s better to use Bs4 than scrapy.<p>Also using PhantomJS, Selenium and BS4 can provide you with a very powerful data scraping solution.
prionsover 8 years ago
I recently used BeautifulSoup in a Wikipedia scraping project. It&#x27;s definitely a great tool but it had a few annoying functionality issues.<p>I had some preprocessing using decompose(), which can take lists of attributes. However it cannot decompose multiple attributes at once such as anchors with a certain title and table elements in one call. It felt cumbersome to call it multiple times.<p>It seems like Scrapy does exactly what I needed though (click on the first link in the article), which would have saved me a ton of headache. So it goes.
geoooooooooboxover 8 years ago
NEWBIES BEWARE!!! USE THE LXML PARSER ... screw the inbuilt html parser