My last job was at a company that develops one of the most popular mobile MMO action games in the world (with hundreds of millions of installs). It stores data in large Cassandra clusters (depending on the platform, DCs contain up to hundred nodes).<p>What I did was designing and developing a command line utility/daemon for performing one-off and regular backups of production data. The solution is able to:<p>- work with a 24/7 live Cassandra cluster, containing tens of nodes<p>- exert tolerable and tuneable performance/latency footprint on the nodes<p>- backup and restore from hundreds of GBs to multiple TBs of data as fast as possible, given the constraints of the legacy data model and concurrent load from online players; observed throughput is 5-25 MB/s, depending on the environment<p>- provides highly flexible declarative configuration of the subset of data to backup and restore (full table exports; raw CQL queries; programmatic extractors) with first-class support for foreign-key dependencies between extractors, compiled into a highly parallelizable execution graph<p>There was an "a-ha!" moment, when I realized, that this utility can be used not only for backups of production data, but for the whole range of day-to-day maintenance tasks, e.g.:<p>1) Restore a subset of production data onto development and test machines. This solves the issue of developers and QA engineers having to fiddle with the database, when they need to test something, whether it be a new feature or a bugfix for production. They can just restore a small subset of real, meaningful and consistent data onto their environment with just a bit of configuration and a simple command. Developers may do this manually when needed, and QA environment can be restored to a clean state automatically by CI server at night.<p>2) Perform arbitrary updates of graphs of database entities. It's a common approach to traverse Cassandra tables, possibly with a column filter, in order to process/update some of the attributes (e.g. iterate through all users and send a push notification to each of them). The more users there are, the longer it takes, and negatively affects the cluster's performance and latency for other concurrent operations. Having a tool like I described, one may clone user data onto a separate machine beforehand (e.g. at night), and then just run the maintenance operation somewhere during the day, on data that it is still reasonably up-to-date.<p>All in all, it was a fun experience of devops, which I'm quite fond of. With just a little creativity and out-of-the-box thinking, there are lots of ways to improve the typical workflow of working with data.