TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Web Scraping with Beautiful Soup (2014)

78 点作者 xcoding超过 8 年前

6 条评论

Animats超过 8 年前
Cute. That&#x27;s what BeautifulSoup is good for.<p>I&#x27;m a longtime user of BeautifulSoup.<p>BeautifulSoup does not use or create &quot;the DOM&quot;. It does convert HTML into a tree, but that tree is somewhat different from a browser&#x27;s Document Object Model. For most screen-scraping purposes, this doesn&#x27;t matter. But if the page uses Javascript to manipulate the DOM on page load, it does.<p>I have a tool for looking at a web page through BeautifulSoup. This reads the page from a server, parses it into a tree with BeautifulSoup using the HTML5 parser, discards all Javascript, makes all links absolute, and turns the tree back into HTML in UTF-8, properly indented. If you run a page through this and it still makes sense, scraping will probably work. If not, simple scraping won&#x27;t work and you&#x27;ll probably have to use a program-controlled browser that will execute JavaScript.<p>Some examples:<p>HN looks good: <a href="http:&#x2F;&#x2F;www.sitetruth.com&#x2F;fcgi&#x2F;viewer.fcgi?url=https:&#x2F;&#x2F;news.ycombinator.com" rel="nofollow">http:&#x2F;&#x2F;www.sitetruth.com&#x2F;fcgi&#x2F;viewer.fcgi?url=https:&#x2F;&#x2F;news.y...</a><p>AFL-CIO, the site used in the article, looks great: <a href="http:&#x2F;&#x2F;www.sitetruth.com&#x2F;fcgi&#x2F;viewer.fcgi?url=aflcio.org" rel="nofollow">http:&#x2F;&#x2F;www.sitetruth.com&#x2F;fcgi&#x2F;viewer.fcgi?url=aflcio.org</a><p>Twitter&#x27;s images disappear: <a href="http:&#x2F;&#x2F;www.sitetruth.com&#x2F;fcgi&#x2F;viewer.fcgi?url=https:&#x2F;&#x2F;www.twitter.com" rel="nofollow">http:&#x2F;&#x2F;www.sitetruth.com&#x2F;fcgi&#x2F;viewer.fcgi?url=https:&#x2F;&#x2F;www.tw...</a><p>Adobe&#x27;s formatting disappears: <a href="http:&#x2F;&#x2F;www.sitetruth.com&#x2F;fcgi&#x2F;viewer.fcgi?url=https:&#x2F;&#x2F;www.adobe.com" rel="nofollow">http:&#x2F;&#x2F;www.sitetruth.com&#x2F;fcgi&#x2F;viewer.fcgi?url=https:&#x2F;&#x2F;www.ad...</a><p>Intel complains about the browser but looks OK: <a href="http:&#x2F;&#x2F;www.sitetruth.com&#x2F;fcgi&#x2F;viewer.fcgi?url=intel.com" rel="nofollow">http:&#x2F;&#x2F;www.sitetruth.com&#x2F;fcgi&#x2F;viewer.fcgi?url=intel.com</a><p>Grubhub gives us nothing as plain HTML: <a href="http:&#x2F;&#x2F;www.sitetruth.com&#x2F;fcgi&#x2F;viewer.fcgi?url=grubhub.com" rel="nofollow">http:&#x2F;&#x2F;www.sitetruth.com&#x2F;fcgi&#x2F;viewer.fcgi?url=grubhub.com</a><p>Same for Doordash: <a href="http:&#x2F;&#x2F;www.sitetruth.com&#x2F;fcgi&#x2F;viewer.fcgi?url=doordash.com" rel="nofollow">http:&#x2F;&#x2F;www.sitetruth.com&#x2F;fcgi&#x2F;viewer.fcgi?url=doordash.com</a><p>(No scraping restaurant menus with BeautifulSoup.)<p>Cool stuff in pure CSS works fine: <a href="http:&#x2F;&#x2F;www.sitetruth.com&#x2F;fcgi&#x2F;viewer.fcgi?url=css3.bradshawenterprises.com&#x2F;cfimg&#x2F;" rel="nofollow">http:&#x2F;&#x2F;www.sitetruth.com&#x2F;fcgi&#x2F;viewer.fcgi?url=css3.bradshawe...</a><p>(You don&#x27;t really need Javascript any more just to get the page up.)
评论 #12935014 未加载
madenine超过 8 年前
My web scraping toolkit (Python):<p>-Beautiful Soup<p>-Requests Lib<p>-JSON Lib<p>-Selenium<p>-Urllib2<p>-Cookielib<p>Handles 99% of things I encounter. With dynamic sites, you&#x27;re often better off simulating the request than controlling the browser.<p>Then you get to parse your way through someone&#x27;s annoyingly formatted JSON.
pryelluw超过 8 年前
I enjoy using scrapy because it allows for a bit more functionality. Check it out if beautiful soup is too simplemente for your needs.
评论 #12933579 未加载
Ed10101超过 8 年前
Absolutely love BeautifulSoup, I use it almost everyday in my job. There isn&#x27;t really a scrapy vs. BS4 divide as you can still use the library with Scrapy, as opposed to it&#x27;s standard parsing functionality. It also works well with lxml which is considerably faster.<p>It&#x27;s also possible to build very performant and large scale crawlers with just BS4 and requests . Though managing the architecture is a bit of a pain, but it definitely can be done.<p>There are also a number of cases where it&#x27;s better to use Bs4 than scrapy.<p>Also using PhantomJS, Selenium and BS4 can provide you with a very powerful data scraping solution.
prions超过 8 年前
I recently used BeautifulSoup in a Wikipedia scraping project. It&#x27;s definitely a great tool but it had a few annoying functionality issues.<p>I had some preprocessing using decompose(), which can take lists of attributes. However it cannot decompose multiple attributes at once such as anchors with a certain title and table elements in one call. It felt cumbersome to call it multiple times.<p>It seems like Scrapy does exactly what I needed though (click on the first link in the article), which would have saved me a ton of headache. So it goes.
geooooooooobox超过 8 年前
NEWBIES BEWARE!!! USE THE LXML PARSER ... screw the inbuilt html parser