Optimizing Solr (Or How To 7x Your Search Speed)

116 点作者 raccoonone大约 13 年前

13 条评论

stevenp大约 13 年前

My experience working with Solr is that a lot of the time people don't have a good working knowledge of how to optimize an index because it's so easy not to. At my last job, the initial implementation they we involved storing the full text of millions of documents, even though they never needed to be retrieved (just searched). If you're running Solr as a front-end search for another database, the best way I've seen to optimize performance is just to make sure you're not storing data unnecessarily.Maybe everyone already should already know this, but I was working on a very smart team, and we totally missed this initially. Setting "stored" to false for most fields resulted in a 90% reduction of the index size, which means less to fit into RAM.

评论 #3746681 未加载

fizx大约 13 年前

Hey, Websolr founder here.Websolr's indexes return in under 50ms for queries of average complexity.The more expensive queries usually involve "faceting" or sorting a large number of results. For an example, say you search Github for "while." Github used to do language facets, where it would tell you that out of a million results, 200103 files were in javascript, 500358 files were in C, etc.The problem with this is that you have to count over a million records, on every search! Unlike most search operations which are IO bound, the counting can be CPU-bound, so sharding on one box will let you take advantage of multiple cores.Racoonone is "sorting on two dimensions, a geo bounding box, four numeric range filters, one datetime range filter, and a categorical range filter." This should put him in a cpu-bound range (in particular because of the sort).Websolr has customers on sharded plans, but they are usually used in custom sales cases where we're serving many, many millions of documents. We'll look at adding sharding as an option to our default plans, so that they'll be more accessible for people like raccoonone. In the meantime, if you send an email to info@onemorecloud.com, we'll try to accomodate use cases like this.Edit: Also, other possible optimizations include (1) indexing in the same order you will sort on, if you know ahead of time, and (2) using the TimeLimitedCollector.

评论 #3746941 未加载

matan_a大约 13 年前

There are quite a few other performance related points to think about for Solr speed for queries and indexing.Here are some that come to mind right now that are very useful:- Be smart about your commit strategy if you're indexing a lot of documents (commitWithin is great). Use batches too.- Many times, i've seen Solr index documents faster than the database could create them (considering joins, denormalizing, etc). Cache these somewhere so you don't have to recreate the ones that haven't changed.- Set up and use the Solr caches properly. Think about what you want to warm and when. Take advantage of the Filter Queries and their cache! It will improve performance quite a bit.- Don't store what you don't need for search. I personally only use Solr to return IDs of the data. I can usually pull that up easily in batch from the DB / KV store. Beats having to reindex data that was just for show anyway...- Solr (Lucene really) is memory greedy and picky about the GC type. Make sure that you're sorted out in that respect and you'll enjoy good stability and consistent speed.- Shards are useful for large datasets, but test first. Some query features aren't available in a sharded environment (YMMV).- Solr is improving quickly and v4 should include some nice cloud functionality (zookeeper ftw).

评论 #3747300 未加载

gpapilion大约 13 年前

I'm curious what your queries look like, because these performance numbers are awful.I'm currently running an index that is 96 million documents(393GB) using a single shard with a response time of 18ms.If you're comfortable with it, I'd suggest profiling Solr. We found that we were spending more time garbage collecting than expected, and spent some time to speed up an minimize the impact of it. Most of this was related to our IO though.Second, don't use the default settings. Adjust the cache sizes, rambuffer, and other settings so they are appropriate for you application.I'd also start instrumenting you web application such that you can start testing removal of query options that may be creating your CPU usage issue. You get a lot of bang for your buck this way, and you may find the options you were using provide no meaningful improvement in search. A metric like mean reciprocal rank can go a long way to improve your performance.

评论 #3761444 未加载

falcolas大约 13 年前

Our company had to set up a Solr implementation with some pretty crazy requirements (hundreds of shards, tens of thousands of requests per second, etc), and we ended up with 4 machines - one for indexing, one as a backup indexer/searcher, and 2 just doing load balanced searches. Replication was interesting but easy to set up (since it's basically an rsync of the indexes between servers).The end result works very well, though it's a real memory hog when you get into the "hundreds" of shards on an individual server.

评论 #3746947 未加载

markelliot大约 13 年前

One thing I think would be valuable to know here is how many threads each shard is using, and what effect changing that number would have.(rather: why is it useful to explicitly shard vs running one big instance with all of the memory and the same total number of threads? queuing theory would lead me to believe the latter would be better)

ABS大约 13 年前

take a look at this presentation if you are interested in NRT Solr (although it was done before Solr added the latest NRT features):Tuning Solr in Near Real Time Search environments: <a href="https://vimeo.com/17402451" rel="nofollow">https://vimeo.com/17402451</a>

snikolic大约 13 年前

Thoughtful sharding is not an optimization, it's a requirement at scale.

sudoman69大约 13 年前

does any one have experience with adding shards on the fly??we have a requirement weher we get millions of docs every day and we need to have an environment that can handle real-time as well previous days' data...any thoughts on this will be appreciated...

评论 #3747140 未加载

zargath大约 13 年前

anybody can recommend a good way to get startet with Solr ?

评论 #3748807 未加载

评论 #3748824 未加载

mthreat大约 13 年前

I'd love to have them try out Searchify's hosted search and see how fast it is. The key to fast search is RAM, which is why we run our search indexes from RAM (not cheap), and most queries are served within 100ms. If you're the author of the blog post, please contact me, chris at searchify, if you'd like to do this comparison, and I'll set you up with a test acct.

评论 #3746636 未加载

评论 #3746574 未加载

phene大约 13 年前

I improved solr performance by switching to elasticsearch. =)

评论 #3747954 未加载

评论 #3748940 未加载

chenli大约 13 年前

This is the founder of Bimaple. We provide hosted search and license our engine software with significantly better performance and capabilities than Lucene: (1) Supporting "Google-Instant" search experiences on your data; (2) Powerful error correction by doing fuzzy search; (3) Optimized for mobile users by doing instant fuzzy search with a speed 10x-100x higher than Lucene; (4) Optimized for geo-location apps; (5) Designed and developed ground up using C++. We have demonstrations on our homepage. If interested in using our service or software, please email contact AT bimaple.com.