The key here is really complementary use of ‘what humans are good at’ and ‘what machines are good at’.<p>In this case, it’s fair to say the machine, by analyzing pixels, can’t figure out perspective very well. The human can do that just fine, given an interface mechanism.<p>The machine is good at detecting edges and seeing similarity between pixels. Given hints from the human that ‘this point is within an object’ and here is the perspective, the machine can infer the limits of the object based on edges/colors and project it into 3 dimensions. Amazing.