TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Ask YC: Blog parsing (WordPress,Typepad,Blogger)

2 点作者 samson超过 16 年前
I'm trying to develop a crawler that knows when the page being sent to it is the actually a Post page, and not the index,search,tag,calendar(November 2008) page.<p>I want these pages -&#62;http://1vibe.net/music/jim-jones-ft-lil-wayne-noe-twista-jackin-swagga-from-us/<p>Not this -&#62;http://1vibe.net/category/behind-the-scenes/ Not this -&#62;http://1vibe.net/2008/11/ Not this -&#62;http://1vibe.net/tag/50-cent/<p>From the blog post page I want to grab the title and date of that post<p>The way I trying to do it was to look through the DOM of the site and look for consistency. I found consistency in Blogger and Typepad but WordPress was all over the place in the formating from site to site.<p>So I figure I must have been doing it wrong and that there is the xml,rdf,feeds a.k.a, the intelligent way of doing it.<p>I appreicate it if anyone could help ( also I'm doing it in php).

2 条评论

raquo超过 16 年前
If you are interested only in new posts, you can look in blogs' RSS feeds. They are nearly always in default locations.<p>Or you could parse the URL - I had a similar task some time ago, and I went with URLs - Blogger and Typepad are consistent; WordPress depends on the blog, of course, but you could figure out several most popular patterns (e. g. /yyyy/mm/dd/posttitle, /id-posttitle) and get like 90% of all blogs right.<p>Or maybe, just maybe, you could use some third parties that have already figured it out via RSS - maybe Technorati?
评论 #373030 未加载
Raphael超过 16 年前
Just parse the URL. Or you can pull in the RSS feed, although that usually only goes back 20 posts.