Found some interesting tidbits in their FAQ [0]:<p>"Q: What type of text can Amazon Textract detect and extract?<p>A: Amazon Textract can detect Latin-script characters from the standard English alphabet and ASCII symbols."<p>So, English only. But <i>very</i> worryingly is that they're going to keep your companies' documents:<p>"Q. Are document and image inputs processed by Amazon Textract stored, and how are they used by AWS?<p>A: Amazon Textract may store and use document and image inputs processed by the service solely to provide and maintain the service and to improve and develop the quality of Amazon Textract..."<p>"Q. Can I delete images and documents stored by Amazon Textract?<p>A: Yes. You can request deletion of document and image inputs associated with your account by contacting AWS Support. Deleting image and document inputs may degrade your Amazon Textract experience."<p>That said, I'm still baffled on what value-add they're providing? For me, from the name alone, it would generate other documents of common types: .txt (without images), .doc, .html (zip). That is, a large part of extracting text is the ability to reflow the text across page boundaries & columns. However, this product states that:<p>"All extracted data is returned with bounding box coordinates" [1]<p>...which is how pdf documents lay things out in the first place...Have I missed something?<p>[0] <a href="https://aws.amazon.com/textract/faqs/" rel="nofollow">https://aws.amazon.com/textract/faqs/</a><p>[1] <a href="https://aws.amazon.com/textract/features/" rel="nofollow">https://aws.amazon.com/textract/features/</a>