The taxonomy is a contribution from a paper we published in SIGMOD'24 (<a href="https://dl.acm.org/doi/10.1145/3626246.3653389" rel="nofollow">https://dl.acm.org/doi/10.1145/3626246.3653389</a>)<p>The insight of the taxonomy is that not all data transformations in AI systems are equivalent. Some data transformations (aggregations, binning, data compression) produce features that can be reused in many models. Some data transformations (feature encoding/scaling, LLM text encoding) are specific to one model. Some data transformations in real-time AI systems require data only available at request-time.