The last time I did anything like this, the easiest workflow I found was to use your favorite high-level runtime for training and just implement a serializer converting the model into source code for your target embedded system. Hand-code the inference loop. This is exactly the strategy TFA landed on.<p>One advantage of having it implemented in code is that you can observe and think about the instructions being generated. TFA didn't talk at all about something pretty important for small/fast neural networks -- the normal "cleanup" code (padding, alignment, length alignment, data-dependent horizontal sums, etc) can dwarf the actual mul->add execution times. You might want to, e.g., ensure your dimensions are all multiples of 8. You definitely want to store weights as column-major instead of row-major if the network is written as vec @ mat instead of mat @ vec (and vice versa for the latter).<p>When you're baking weights and biases into code like that, use an affine representation -- explicitly pad the input with the number one, along with however many extra zeroes you need for any other length padding requirements make sense for your problem (usually zero for embedded, but this is a similar workflow to low-resource networks on traditional computers, where you probably want vectorization).<p>Floats are a tiny bit hard to avoid for dot products. For similar precision, you require nearly twice the bit count in a fixed-point representation just to make the multiplies work, plus some extra bits proportional to the log2 of the dimension. E.g., if you trained on f16 inputs then you'll have roughly comparable precision with i32 fixed-point weights, and that's assuming you go through the effort to scale and shift everything into an appropriate numerical regime. Twice the instruction count (or thereabouts) on twice the register width makes fixed-point 2-4x slower for similar precision than a hardware float, supposing those wide instructions exist for your microcontroller, and soft floats are closer to 10x slower for multiply-accumulate. If you're emulating wide integer instructions, just use soft floats. If you don't care about a 4x slowdown, just use soft floats.<p>Training can be a little finicky for small networks. At a minimum, you probably want to create train/test/validate sets and have many training runs. There are other techniques if you want to go down a rabbit hole.<p>Other ML architectures can be much more performant here. Gradient-boosted trees are already SOTA on many of these problems, and oblivious trees map extremely well to normal microcontroller instruction sets. By skipping the multiplies, your fixed-point precision is on par with floats of similar bit-width, making quantization a breeze.