科技回声

6 条评论

Animats超过 8 年前

Cute. That's what BeautifulSoup is good for.I'm a longtime user of BeautifulSoup.BeautifulSoup does not use or create "the DOM". It does convert HTML into a tree, but that tree is somewhat different from a browser's Document Object Model. For most screen-scraping purposes, this doesn't matter. But if the page uses Javascript to manipulate the DOM on page load, it does.I have a tool for looking at a web page through BeautifulSoup. This reads the page from a server, parses it into a tree with BeautifulSoup using the HTML5 parser, discards all Javascript, makes all links absolute, and turns the tree back into HTML in UTF-8, properly indented. If you run a page through this and it still makes sense, scraping will probably work. If not, simple scraping won't work and you'll probably have to use a program-controlled browser that will execute JavaScript.Some examples:HN looks good: <a href="http://www.sitetruth.com/fcgi/viewer.fcgi?url=https://news.ycombinator.com" rel="nofollow">http://www.sitetruth.com/fcgi/viewer.fcgi?url=https://news.y...</a>AFL-CIO, the site used in the article, looks great: <a href="http://www.sitetruth.com/fcgi/viewer.fcgi?url=aflcio.org" rel="nofollow">http://www.sitetruth.com/fcgi/viewer.fcgi?url=aflcio.org</a>Twitter's images disappear: <a href="http://www.sitetruth.com/fcgi/viewer.fcgi?url=https://www.twitter.com" rel="nofollow">http://www.sitetruth.com/fcgi/viewer.fcgi?url=https://www.tw...</a>Adobe's formatting disappears: <a href="http://www.sitetruth.com/fcgi/viewer.fcgi?url=https://www.adobe.com" rel="nofollow">http://www.sitetruth.com/fcgi/viewer.fcgi?url=https://www.ad...</a>Intel complains about the browser but looks OK: <a href="http://www.sitetruth.com/fcgi/viewer.fcgi?url=intel.com" rel="nofollow">http://www.sitetruth.com/fcgi/viewer.fcgi?url=intel.com</a>Grubhub gives us nothing as plain HTML: <a href="http://www.sitetruth.com/fcgi/viewer.fcgi?url=grubhub.com" rel="nofollow">http://www.sitetruth.com/fcgi/viewer.fcgi?url=grubhub.com</a>Same for Doordash: <a href="http://www.sitetruth.com/fcgi/viewer.fcgi?url=doordash.com" rel="nofollow">http://www.sitetruth.com/fcgi/viewer.fcgi?url=doordash.com</a>(No scraping restaurant menus with BeautifulSoup.)Cool stuff in pure CSS works fine: <a href="http://www.sitetruth.com/fcgi/viewer.fcgi?url=css3.bradshawenterprises.com/cfimg/" rel="nofollow">http://www.sitetruth.com/fcgi/viewer.fcgi?url=css3.bradshawe...</a>(You don't really need Javascript any more just to get the page up.)

评论 #12935014 未加载

madenine超过 8 年前

My web scraping toolkit (Python):-Beautiful Soup-Requests Lib-JSON Lib-Selenium-Urllib2-CookielibHandles 99% of things I encounter. With dynamic sites, you're often better off simulating the request than controlling the browser.Then you get to parse your way through someone's annoyingly formatted JSON.

pryelluw超过 8 年前

I enjoy using scrapy because it allows for a bit more functionality. Check it out if beautiful soup is too simplemente for your needs.

评论 #12933579 未加载

Ed10101超过 8 年前

Absolutely love BeautifulSoup, I use it almost everyday in my job. There isn't really a scrapy vs. BS4 divide as you can still use the library with Scrapy, as opposed to it's standard parsing functionality. It also works well with lxml which is considerably faster.It's also possible to build very performant and large scale crawlers with just BS4 and requests . Though managing the architecture is a bit of a pain, but it definitely can be done.There are also a number of cases where it's better to use Bs4 than scrapy.Also using PhantomJS, Selenium and BS4 can provide you with a very powerful data scraping solution.

prions超过 8 年前

I recently used BeautifulSoup in a Wikipedia scraping project. It's definitely a great tool but it had a few annoying functionality issues.I had some preprocessing using decompose(), which can take lists of attributes. However it cannot decompose multiple attributes at once such as anchors with a certain title and table elements in one call. It felt cumbersome to call it multiple times.It seems like Scrapy does exactly what I needed though (click on the first link in the article), which would have saved me a ton of headache. So it goes.

geooooooooobox超过 8 年前

NEWBIES BEWARE!!! USE THE LXML PARSER ... screw the inbuilt html parser

6 条评论

Animats超过 8 年前

评论 #12935014 未加载

madenine超过 8 年前

pryelluw超过 8 年前

I enjoy using scrapy because it allows for a bit more functionality. Check it out if beautiful soup is too simplemente for your needs.

评论 #12933579 未加载

Ed10101超过 8 年前

prions超过 8 年前

geooooooooobox超过 8 年前

NEWBIES BEWARE!!! USE THE LXML PARSER ... screw the inbuilt html parser

Web Scraping with Beautiful Soup (2014)

6 条评论

Web Scraping with Beautiful Soup (2014)

6 条评论