Various frontier LLMs were evaluated on their ability to interpret handwritten proofreading marks in printed literary text, using a small benchmark based on Charles Dickens's "Little Dorrit". Results are modest at best, and surprisingly variable across repeated runs, even on the same pages, underscoring the challenge in building reliable, structured-document systems with current multimodal LLMs.<p>Curious to hear thoughts from others working on similar problems.