TechEcho

6 comments

Animatsover 8 years ago

Cute. That's what BeautifulSoup is good for.I'm a longtime user of BeautifulSoup.BeautifulSoup does not use or create "the DOM". It does convert HTML into a tree, but that tree is somewhat different from a browser's Document Object Model. For most screen-scraping purposes, this doesn't matter. But if the page uses Javascript to manipulate the DOM on page load, it does.I have a tool for looking at a web page through BeautifulSoup. This reads the page from a server, parses it into a tree with BeautifulSoup using the HTML5 parser, discards all Javascript, makes all links absolute, and turns the tree back into HTML in UTF-8, properly indented. If you run a page through this and it still makes sense, scraping will probably work. If not, simple scraping won't work and you'll probably have to use a program-controlled browser that will execute JavaScript.Some examples:HN looks good: <a href="http://www.sitetruth.com/fcgi/viewer.fcgi?url=https://news.ycombinator.com" rel="nofollow">http://www.sitetruth.com/fcgi/viewer.fcgi?url=https://news.y...</a>AFL-CIO, the site used in the article, looks great: <a href="http://www.sitetruth.com/fcgi/viewer.fcgi?url=aflcio.org" rel="nofollow">http://www.sitetruth.com/fcgi/viewer.fcgi?url=aflcio.org</a>Twitter's images disappear: <a href="http://www.sitetruth.com/fcgi/viewer.fcgi?url=https://www.twitter.com" rel="nofollow">http://www.sitetruth.com/fcgi/viewer.fcgi?url=https://www.tw...</a>Adobe's formatting disappears: <a href="http://www.sitetruth.com/fcgi/viewer.fcgi?url=https://www.adobe.com" rel="nofollow">http://www.sitetruth.com/fcgi/viewer.fcgi?url=https://www.ad...</a>Intel complains about the browser but looks OK: <a href="http://www.sitetruth.com/fcgi/viewer.fcgi?url=intel.com" rel="nofollow">http://www.sitetruth.com/fcgi/viewer.fcgi?url=intel.com</a>Grubhub gives us nothing as plain HTML: <a href="http://www.sitetruth.com/fcgi/viewer.fcgi?url=grubhub.com" rel="nofollow">http://www.sitetruth.com/fcgi/viewer.fcgi?url=grubhub.com</a>Same for Doordash: <a href="http://www.sitetruth.com/fcgi/viewer.fcgi?url=doordash.com" rel="nofollow">http://www.sitetruth.com/fcgi/viewer.fcgi?url=doordash.com</a>(No scraping restaurant menus with BeautifulSoup.)Cool stuff in pure CSS works fine: <a href="http://www.sitetruth.com/fcgi/viewer.fcgi?url=css3.bradshawenterprises.com/cfimg/" rel="nofollow">http://www.sitetruth.com/fcgi/viewer.fcgi?url=css3.bradshawe...</a>(You don't really need Javascript any more just to get the page up.)

评论 #12935014 未加载

madenineover 8 years ago

My web scraping toolkit (Python):-Beautiful Soup-Requests Lib-JSON Lib-Selenium-Urllib2-CookielibHandles 99% of things I encounter. With dynamic sites, you're often better off simulating the request than controlling the browser.Then you get to parse your way through someone's annoyingly formatted JSON.

pryelluwover 8 years ago

I enjoy using scrapy because it allows for a bit more functionality. Check it out if beautiful soup is too simplemente for your needs.

评论 #12933579 未加载

Ed10101over 8 years ago

Absolutely love BeautifulSoup, I use it almost everyday in my job. There isn't really a scrapy vs. BS4 divide as you can still use the library with Scrapy, as opposed to it's standard parsing functionality. It also works well with lxml which is considerably faster.It's also possible to build very performant and large scale crawlers with just BS4 and requests . Though managing the architecture is a bit of a pain, but it definitely can be done.There are also a number of cases where it's better to use Bs4 than scrapy.Also using PhantomJS, Selenium and BS4 can provide you with a very powerful data scraping solution.

prionsover 8 years ago

I recently used BeautifulSoup in a Wikipedia scraping project. It's definitely a great tool but it had a few annoying functionality issues.I had some preprocessing using decompose(), which can take lists of attributes. However it cannot decompose multiple attributes at once such as anchors with a certain title and table elements in one call. It felt cumbersome to call it multiple times.It seems like Scrapy does exactly what I needed though (click on the first link in the article), which would have saved me a ton of headache. So it goes.

geoooooooooboxover 8 years ago

NEWBIES BEWARE!!! USE THE LXML PARSER ... screw the inbuilt html parser

6 comments

Animatsover 8 years ago

评论 #12935014 未加载

madenineover 8 years ago

pryelluwover 8 years ago

I enjoy using scrapy because it allows for a bit more functionality. Check it out if beautiful soup is too simplemente for your needs.

评论 #12933579 未加载

Ed10101over 8 years ago

prionsover 8 years ago

geoooooooooboxover 8 years ago

NEWBIES BEWARE!!! USE THE LXML PARSER ... screw the inbuilt html parser

Web Scraping with Beautiful Soup (2014)

6 comments

Web Scraping with Beautiful Soup (2014)

6 comments