Search is a gold mine, and I don't understand why there aren't more people diving in to building niche search engines. Sure you can't really compete with Google on size, but there's a lot of nooks and crannies online where you can pick up valuable search traffic around the edges.<p>At least, that's why I'm working on a search engine for financial news at <a href="http://Newsley.com/search" rel="nofollow">http://Newsley.com/search</a>. (We're focused on building the crawlers and the index right now. Search is _very_ alpha).<p>After reading this article, I feel validated for a bunch of the decisions that I've been making. I've been running on EC2, but their disk IO is slow as molasses. So, I'm starting to build servers and throw them in my garage. I'll be migrating to garage servers in the next few months. Pretty much everyone I talk to thinks running servers in your garage is a terrible idea, but I can't think of any way else to do this cheaper and still have control over my hardware. It's nice to read that I'm not crazy for thinking this.<p>It was also great to read that on early search engines, the bulk of the work is done by small teams. Being the only dev, at times I think I'm a bit crazy for trying to boostrap a search startup. Again, it was nice to read that it's not all that crazy to try and do it on my own.
This is from 2004. A lot of the paper still applies, in principle, but I'd argue that there are far fewer people chomping at the bit to get in to the search business these days. Now it's all "social" or "game" related.
In other words, avoid spending money, refine your algorithms first. Faster machines may be tempting, but that makes scaling horribly expensive down the road.
I think for most people writing a search engine is overkill when there are existing options out there.<p>If you want to search a subset of sites, then Google CSE is really all you need + whatever bells & whistles you'd like to add around it. I've done that here: <a href="http://searchESLCafe.com" rel="nofollow">http://searchESLCafe.com</a>, adding "recent searches", search via wildcard subdomain (i.e. foo.searchESLCafe.com or bar.searchESLCafe.com or foo_bar.searchESLCafe.com, etc), and customizing the heck out of Google CSE's options.<p>Is there a demand out there for the search engine to parse the results into something informative at-a-glance? I'm not so sure it's the user's first priority. Or, to put it another way, there's plenty of hard-to-reach info out there that you can hand users via a customized Google CSE, and they don't mind doing the leg-work of clicking on the query results and finding their own answers.<p>It's a lot more important to have an accurate search algorithm than drill-down-related bells & whistles.<p>Google does a great job of returning solid results for any subset of sites, so why not let Google handle it, and concentrate on the other stuff?