Among other things, I have worked with and developed technology in the uncompressed professional imaging domain for decades. One of the things I always watch out for is precisely the terminology and language used in this release:<p>"for equal perceptual quality"<p>Put a different way: We can fool your eyes/brain into thinking you are looking at the same images.<p>For most consumer use cases where the objective is to view images --rather than process them-- this is fine. The human vision system (HVS, eyes + brain processing) is tolerant of and can handle lots of missing or distorted data. However, the minute you get into having to process the images in hardware or software things can change radically.<p>Take, as an example, color sub-sampling. You start with a camera with three distinct sensors. Each sensor has a full frame color filter. They are optically coupled to see the same image through a prism. This means you sample the red, green and blue portions of the visible spectrum at full spatial resolution. If we are talking about a 1K x 1K image, you are capturing one million pixels of each, red, green and blue.<p>BTW, I am using "1K" to mean one thousand, not 1024.<p>Such a camera is very expensive and impractical for consumer applications. Enter the Bayer filter [0].<p>You can now use a single sensor to capture all three color components. However, instead of having one million samples for each components you have 250K red, 500K green and 250K blue. Still a million samples total (that's the resolution of the sensor) yet you've sliced it up into three components.<p>This can be reconstructed into full one million samples per color components through various techniques, one of them being the use of polyphase FIR (Finite Impulse Response) filters looking across a range of samples. Generally speaking, the wider the filter the better the results, however, you'll always have issues around the edges of the image. There are also more sophisticated solutions that apply FIR filters diagonally as well as temporally (use multiple frames).<p>You are essentially trying to reconstruct the original image by guessing or calculating the missing samples. By doing so you introduce spatial (and even temporal) frequency domain issues that would not have been present in the case of a fully sampled (3 sensor) capture system.<p>In a typical transmission chain the reconstructed RGB data is eventually encoded into the YCbCr color space [1]. I think of this as the first step in the perceptual "let's see what we can get away with" encoding process. YCbCr is about what the HVS sees. "Y" is the "luma", or intensity component. "Cb" and "Cr" are color difference samples for blue and red.<p>However, it doesn't stop there. The next step is to, again, subsample some of it in order to reduce data for encoding, compression, storage and transmission. This is where you get into the concept of chroma subsampling [2] and terminology such as 4:4:4, 4:2:2, etc.<p>Here, again, we reduce data by throwing away (not quite) color information. It turns out your brain can deal with irregularities in color far more so than in the luma, or intensity, portion of an image. And so, "4:4:4" means we take every sample of the YCbCr encoded image, while "4:2:2" means we cut down Cb and Cr in half.<p>There's an additional step which encodes the image in a nonlinear fashion, which, again, is a perceptual trick. This introduces Y' (Y prime) as "luminance" rather than "luma". It turns out that your HVS is far more sensitive to minute detail in the low-lights (the darker portions of the image, say, from 50% down to black) than in the highlights. You can have massive errors in the highlights and your HVS just won't see them, particularly if things are blended through wide FIR filters during display. [3]<p>Throughout this chain of optical and mathematical wrangling you are highly dependent on the accuracy of each step in the process. How much distortion is introduced depends on a range of factors, not the least of which is the way math is done in software or chips that touch every single sample's data. With so much math in the processing chain you have to be extremely careful about not introducing errors by truncation or rounding.<p>We then introduce compression algorithms. In the case of motion video they will typically compress a reference frame as a still and then encode the difference with respect to that frame for subsequent frames. They divide an image into blocks of pixels and then spatially process these blocks to develop a dictionary of blocks to store, transmit, etc.<p>The key technology in compression is the Discrete Cosine Transform (DCT) [4]. This bit of math transforms the image from the spatial domain to the frequency domain. Once again, we are trying to trick the eye. Reduce information the HVS might not perceive. We are not as sensitive to detail, which means it's safe to remove some detail. That's what DCT is about.<p>So, we started with a 3 sensor full-sampling camera, reduced it to a single sensor and three away 75% of red samples, 50% of green samples and 75% of blue samples. We then reconstruct the full RGB data mathematically, perceptually encode it to YCbCr, apply gamma encoding if necessary, apply DCT to reduce high frequency information based on agreed-upon perceptual thresholds and then store and transmit the final result. For display on an RGB display we reverse the process. Errors are introduced every step of the way, the hope and objective being to trick the HVS into seeing an acceptable image.<p>All of this is great for watching a movie or a TikTok video. However, when you work in machine vision or any domain that requires high quality image data, the issues with the processing chain presented above can introduce problems with consequences ranging from the introduction of errors (Was that a truck in front of our self driving car or something else?) to making it impossible to make valid use of the images (Is that a tumor or healthy tissue?).<p>While H.266 sounds fantastic for TikTok or Netflix, I fear that the constant effort to find creative ways to trick the HVS might introduce issues in machine vision, machine learning and AI that most in the field will not realize. Unless someone has a reasonable depth of expertise in imaging they might very well assume the technology they are using is perfectly adequate for the task. Imagine developing a training data set consisting of millions of images without understanding the images have "processing damage" because of the way they were acquired and processed before they even saw their first learning algorithm.<p>Having worked in this field for quite some time --not many people take a 20x magnifying lens to pixels on a display to see what the processing is doing to the image-- I am concerned about the divergence between HVS trickery, which, again, is fine for TikTok and Netflix and MV/ML/AI. A while ago there was a discussion on HN about ML misclassification of people of color. While I haven't looked into this in detail, I am convinced, based on experience, that the numerical HVS trickery I describe above has something to do with this problem. If you train models with distorted data you have to expect errors in classification. As they say, garbage-in, garbage-out.<p>Nothing wrong with H.266, it sounds fantastic. However, I think MV/ML/AI practitioners need to be deeply aware of what data they are working with and how it got to their neural network. It is for this reason that we've avoided using off-the-shelf image processing chips to the extent possible. When you use an FPGA to process images with your own processing chain you are in control of what happens to every single pixel's data and, more importantly, you can qualify and quantify any errors that might be introduced in the chain.<p>[0] <a href="https://en.wikipedia.org/wiki/Bayer_filter" rel="nofollow">https://en.wikipedia.org/wiki/Bayer_filter</a><p>[1] <a href="https://en.wikipedia.org/wiki/YCbCr" rel="nofollow">https://en.wikipedia.org/wiki/YCbCr</a><p>[2] <a href="https://en.wikipedia.org/wiki/Chroma_subsampling" rel="nofollow">https://en.wikipedia.org/wiki/Chroma_subsampling</a><p>[3] <a href="https://en.wikipedia.org/wiki/Gamma_correction" rel="nofollow">https://en.wikipedia.org/wiki/Gamma_correction</a><p>[4] <a href="https://www.youtube.com/watch?v=P7abyWT4dss" rel="nofollow">https://www.youtube.com/watch?v=P7abyWT4dss</a>