I'm trying to develop a crawler that knows when the page being sent to it is the actually a Post page, and not the index,search,tag,calendar(November 2008) page.<p>I want these pages ->http://1vibe.net/music/jim-jones-ft-lil-wayne-noe-twista-jackin-swagga-from-us/<p>Not this ->http://1vibe.net/category/behind-the-scenes/
Not this ->http://1vibe.net/2008/11/
Not this ->http://1vibe.net/tag/50-cent/<p>From the blog post page I want to grab the title and date of that post<p>The way I trying to do it was to look through the DOM of the site and look for consistency.
I found consistency in Blogger and Typepad but WordPress was all over the place in the formating from site to site.<p>So I figure I must have been doing it wrong and that there is the xml,rdf,feeds a.k.a, the intelligent way of doing it.<p>I appreicate it if anyone could help ( also I'm doing it in php).
If you are interested only in new posts, you can look in blogs' RSS feeds. They are nearly always in default locations.<p>Or you could parse the URL - I had a similar task some time ago, and I went with URLs - Blogger and Typepad are consistent; WordPress depends on the blog, of course, but you could figure out several most popular patterns (e. g. /yyyy/mm/dd/posttitle, /id-posttitle) and get like 90% of all blogs right.<p>Or maybe, just maybe, you could use some third parties that have already figured it out via RSS - maybe Technorati?