The Iranian classified advertisements platform “Divar” recently introduced a new feature that automatically removes all furniture in an image to ease posting real estate ads online.<p>I am interested in the approach used for this feature. I suspect they are using a diffusion-based technique with a U-Net pipeline for segmentation and then masking to modify only the areas with furniture.<p>Here’s the first official example: https://twitter.com/Divar_Official/status/1815343371424026864<p>and here’s the second example: https://imgur.com/a/QzL7gMm<p>As you can see in the first example the bedroom is dense (check typical Iranian interior design on Google) which led me to believe that they have been collecting a novel dataset of Iranian houses with and without furniture.<p>Quote from the news website Zoomit:
> The "Furniture Remover" tool of the "Retouch" service uses AI to remove all the furniture seen in the photo in a few minutes and deliver a photo of an empty house or office without the hassle of moving things!<p>Here they talk about it taking a “few minutes”. Now I don't know whether the few minutes are spent on processing or if it’s an estimated queue time.
I assumed they were talking about the processing time which originally led me to suspect that they were using diffusion and not some other method like a GAN. But there's a good chance that I'm wrong here.<p>As seen in the first example, the image resolution is quite low and needs a lot of post-processing to become usable. The second example improves the resolution but you should also consider the scene complexity. This further strengthened the possibility of them using diffusion. However can’t GANs also become unstable at high res?<p>Zoomit continues with:
> This tool does not add anything to [real estate] photos to preserve the photo's originality. It only uses the components in the image to expand and reconstruct the image.<p>This weakens my guess as diffusion-based models excel at text-to-image synthesis, they are not guaranteed to not introduce third-party artifacts when used in an img2img context. But again, assuming that we are using a mask, this shouldn’t be a problem.<p>What are your thoughts on this?