科技回声

1 comment

midko超过 10 年前

The talk is about the system's design decisions, not about ML. Bits I found interesting:* Tushar Chandra believes soon ML primitives will be available to application developers just like nowadays distributed and db primitives are becoming available in a standard way* There was an early design decision for Sibyl not to be build on top of a custom distributed system solution but instead to rely on existing primitives such as MapReduce and GFS.* 100B+ training examples with 100s features per example, use cases with 50TB of data* Because logging all the features of each examples can make the logs grow extremely fast and because some features might be experimentally used and come and go, the logs would contain only the example id and then before training the model, the data will be inner joined with the example database/GFS* Examples were stored as columns (partition by features, each file containing 1 feature for many examples) instead of the more common approach of partitioning per row where you store all features of a batch of examples in the same file. This had great benefits in terms of faster feature transformations, less data to be read because some features were less useful than others, and better compression of the data. Further feature compression achieved by finding all unique feature values and mapping them to numbers in a Huffman encoding way. Total compression achieved was 3-5 timesTowards the end, the talk contains some use cases with big numbers (throughput per core, for e.g.) worth checking out

Sibyl: A System for Large Scale Machine Learning at Google

1 comment

Sibyl: A System for Large Scale Machine Learning at Google

1 comment