One of the long term things we want to address in Redis is lack of tools to analyze the dataset directly available in the Redis distribution, as first class citizens.<p>There are a few like redis-cli --latency, or redis-key --bigkeys (I encourage you to try both if you never did it before), but more are needed in order to profile memory, big keys, pretty-print the Slow Log output, and so forth.<p>The availability of external tools is surely of a great help, but I hope that in the long run we'll have the most important things integrated into redis-cli directly.
Great solution, but...<p>"Maybe I should revert to storing ratings in PostgreSQL and accept what would certainly be a large performance hit during recommendation generation."<p>I wondered if you prematurely optimized here. Did you try Postgres in the first place? What was the performance like? I can't help but wonder if you dismissed Postgres simply because it wasn't as sexy as redis.
If the values you are storing are integers, you should also look at your zset-max-ziplist-entries config setting. I've been able to shrink a 30GB Redis memory footprint to just under 3GB. The caveat is that all search operations in ziplists are sequential (since it uses variable-length strings and integers), but oftentimes scanning a short array is as fast or faster than hashing or locating an entry in a skiplist. There is more details about this here <a href="http://redis.io/topics/memory-optimization" rel="nofollow">http://redis.io/topics/memory-optimization</a>, and I've submitted a patch to Redis that makes it even more efficient (<a href="https://github.com/antirez/redis/issues/469" rel="nofollow">https://github.com/antirez/redis/issues/469</a>).
For relational databases, we already have some pretty good rules to avoid problems like this - the normal forms (specifically BCNF, which is "the" normal form to reduce redundancy inside a single relation). I wonder what rules apply to non-relational models? Has anyone done research into this?
>What could possibly be causing it to use that much memory? I considered the length of my keys...Maybe Redis was eating memory by storing tens of thousands of really long key names in RAM? I decided to try shortening them to a more reasonable format: u:1234:lb for example.<p>Wow ... why would you reengineer your naming conventions without at least doing some back-of-a-napkin arithmetic beforehand and compare the potential savings? That strikes me as ... insane.
tl;dr version: his recommendation algorithm stored an amount of data that was Θ(n^2) + Θ(n*m) for # of users, beers. He optimized that to put a constant limit on the storage space used.<p>The clearest takeaway is: if you want to reduce the disk or memory footprint of your DB, you need to figure out what tables/rows/columns are consuming the lion's share of the space (there's almost always one column that's way worse than the rest) and figure out how to change your app logic to store much less data in that column.<p>The OP's lessons learned apply no matter which DB you're using!
This reminds me of a situation I've seen and found myself in far too many times. When I encounter a performance problem I immediately think about areas where, during initial development, I stopped and said "this might cause a problem with a large amount of data" versus doing the proper thing of actually hooking up a profiler or doing proper testing to determine what the actual problem is; Rarely is it in an area I previously identified as potentially causing future problems.<p>Ego (I know what the problem is!) and the drive to fix problems quickly always get in the of finding the solution. Tales like this are a great reminder to not assume you know what the problem is.
I was thinking you had switched from string keys to hashes, since I had read a post from antirez about hashes being more efficient for this type of thing. But as they say, measure twice, cut once. After all, profiling is the first step of optimisation.
I see a few good lessons here:<p>1. You can learn a lot about your own code and your tools by digging in and debugging and experimenting.<p>2. As macspoofing argues, having a correct mental model of your application, and your tools, is what separates professional programmers from amateurs and beginners. You don't need to do much math or understand Fermi estimates to figure out that removing 30 bytes of excess key name from, say, 100,000 rows is only going to get back a few megs. However doing the experiment validated your hypothesis and taught you something about how Redis (probably) stores keys internally. For many real problems with real data, estimating is orders of magnitude faster and simpler (and probably safer) than experimenting, so it's a skill to cultivate.<p>3. This is exactly the kind of data management problem that relational databases are <i>really good</i> at. Relational theory is based on sets and RDBMSs are good at working with sets. But you would need to normalize your schema and know how to write efficient SQL -- perhaps an exercise for another day. RDBMSs and SQL are not easy to learn and take real study and practice, but the rewards are significant. I agree with AznHisoka that dropping PostgreSQL (or MySQL) in favor of Redis was a premature optimization, but you would need to spend a lot of time mastering relational database design and SQL to get the benefits of an RDBMS, whereas Redis doesn't make the same demand. If you posted more details of your data and how you need to query it on stackoverflow you'd get some free lessons in schema design and normalization.<p>4. A database of a few tens or even hundreds of thousands of rows and taking up only 50 megs of RAM is trivial for any RDBMS. Not trivial as in not important, but in the same sense that delivering a carton of milk with a semi truck is a trivial use of the truck. Your data set would not begin to run into any performance limits of a stock RDBMS. I'm not criticizing or making fun, just stating a fact.<p>5. Don't assume you know what the problem is when debugging or optimizing -- it's too easy to go into the weeds and get frustrated. Over time you'll probably get better at narrowing the range of possibilities but even the most experienced programmers are frequently wrong about the source of a bug or performance problem. Your methodology is correct, do the same thing next time.<p>You say that you are relatively new to programming and databases, and there's nothing wrong with that. We've all made the same kinds of decisions you have, and some of us are still making them without the kind of introspection you've documented.<p>If your code seems too complex (something you develop a feel for over time) you need to choose better data structures. With the right data structures the code writes itself. When you speculate about using an array column type in Postgres I know with 99.99% certainty that you are not normalizing -- nothing you've described can't be accomplished with normalized relations. And you'd be able to do all of your queries with simple SQL statements.<p>I recommend Chris Date's book "Database in Depth: Relational Theory for Practitioners," which is aimed at programmers rather than DBAs.<p>I've written more on this (and linked to some more good books) on my site at <a href="http://typicalprogrammer.com/?p=14" rel="nofollow">http://typicalprogrammer.com/?p=14</a>