科技回声

I'm trying to develop a crawler that knows when the page being sent to it is the actually a Post page, and not the index,search,tag,calendar(November 2008) page.I want these pages ->http://1vibe.net/music/jim-jones-ft-lil-wayne-noe-twista-jackin-swagga-from-us/Not this ->http://1vibe.net/category/behind-the-scenes/ Not this ->http://1vibe.net/2008/11/ Not this ->http://1vibe.net/tag/50-cent/From the blog post page I want to grab the title and date of that postThe way I trying to do it was to look through the DOM of the site and look for consistency. I found consistency in Blogger and Typepad but WordPress was all over the place in the formating from site to site.So I figure I must have been doing it wrong and that there is the xml,rdf,feeds a.k.a, the intelligent way of doing it.I appreicate it if anyone could help ( also I'm doing it in php).

2 条评论

raquo超过 16 年前

If you are interested only in new posts, you can look in blogs' RSS feeds. They are nearly always in default locations.Or you could parse the URL - I had a similar task some time ago, and I went with URLs - Blogger and Typepad are consistent; WordPress depends on the blog, of course, but you could figure out several most popular patterns (e. g. /yyyy/mm/dd/posttitle, /id-posttitle) and get like 90% of all blogs right.Or maybe, just maybe, you could use some third parties that have already figured it out via RSS - maybe Technorati?

评论 #373030 未加载

Raphael超过 16 年前

Just parse the URL. Or you can pull in the RSS feed, although that usually only goes back 20 posts.

Ask YC: Blog parsing (WordPress,Typepad,Blogger)

2 条评论

Ask YC: Blog parsing (WordPress,Typepad,Blogger)

2 条评论