In the past I used to build this stuff in silicon .... just a quick note it's quite common in the real world to get incoming values that are out of gamut (the results of compressing and decompressing data for example) - you need to pin your final values to 0-255 in the cases where they are < 0 or > 255 otherwise they tend to wrap, you also need to carry that extra bit of precision through your math (not such an issue with FP) to make sure you can detect this