The front end that polls RSS feeds, sitemaps, and otherwise imports content into a database isn't too hard.<p>What's devilishly difficult is the nature of "news".<p>That is, when a "news" story happens (say Will Smith slaps Chris Rock at the Oscars) there will be hundreds of articles about it from mainstream publications right away.<p>For the news feed to be manageable you have to cluster these, otherwise you are going to be furious that you can't find any news in the middle of all that "spam".<p>Defining the cluster boundaries are tricky. For instance 'Will Smith v. Chris Rock' is an ongoing story. There is news about the initial event but there could be news about possible lawsuits, apologies, hard feelings, revenge. Also people are going to write opinion pieces blowing it out their ass forever. So it's not so simple as "say something once why say it again..." but rather you have to be able to initially identify an event and then identify a string of events which are connected to that original event.