Amazon Textract – Extract text and data from virtually any document

229 点作者 mcrute超过 6 年前

20 条评论

cmroanirgo超过 6 年前

Found some interesting tidbits in their FAQ [0]:"Q: What type of text can Amazon Textract detect and extract?A: Amazon Textract can detect Latin-script characters from the standard English alphabet and ASCII symbols."So, English only. But very worryingly is that they're going to keep your companies' documents:"Q. Are document and image inputs processed by Amazon Textract stored, and how are they used by AWS?A: Amazon Textract may store and use document and image inputs processed by the service solely to provide and maintain the service and to improve and develop the quality of Amazon Textract...""Q. Can I delete images and documents stored by Amazon Textract?A: Yes. You can request deletion of document and image inputs associated with your account by contacting AWS Support. Deleting image and document inputs may degrade your Amazon Textract experience."That said, I'm still baffled on what value-add they're providing? For me, from the name alone, it would generate other documents of common types: .txt (without images), .doc, .html (zip). That is, a large part of extracting text is the ability to reflow the text across page boundaries & columns. However, this product states that:"All extracted data is returned with bounding box coordinates" [1]...which is how pdf documents lay things out in the first place...Have I missed something?[0] <a href="https://aws.amazon.com/textract/faqs/" rel="nofollow">https://aws.amazon.com/textract/faqs/</a>[1] <a href="https://aws.amazon.com/textract/features/" rel="nofollow">https://aws.amazon.com/textract/features/</a>

评论 #18555729 未加载

评论 #18556685 未加载

评论 #18557671 未加载

评论 #18567348 未加载

danso超过 6 年前

Given how high and continuing the popularity of the "simple" conversion of regular PDF forms/tables -- even for the technically-sophisticated HN audience [0] -- if Amazon can deliver on OCR-to-data, that feels like a huge achievement. Not as sexy (or creepy) as Rekognition, perhaps, but almost certainly more day-to-day useful to the many, many professionals who work with documents and legacy data entry systems.[0] <a href="https://hn.algolia.com/?query=pdf%20convert&sort=byPopularity&prefix&page=0&dateRange=all&type=story" rel="nofollow">https://hn.algolia.com/?query=pdf%20convert&sort=byPopularit...</a>- <a href="https://news.ycombinator.com/item?id=18199708" rel="nofollow">https://news.ycombinator.com/item?id=18199708</a>- <a href="https://news.ycombinator.com/item?id=5487530" rel="nofollow">https://news.ycombinator.com/item?id=5487530</a>

评论 #18554311 未加载

评论 #18557541 未加载

raghavtoshniwal超过 6 年前

This plays so well with the theory of AWS taking a slice of all web activity. They are commoditising more and more complex tasks and enabling huge number of engineers to bootstrap their idea with amazing tech from day 1. A huge jump from S3/EC2 to this. Commendable.

评论 #18555419 未加载

评论 #18557011 未加载

Edmond超过 6 年前

Not sure if this is bad news for the Robotic Process Automation (RPA) sector or an opportunity to offload the "Robotic" part while focusing on business process...

评论 #18555825 未加载

评论 #18555803 未加载

efields超过 6 年前

Is off the shelf open source OCR not reliable for an image of reasonable fidelity, like a smartphone camera picture of a B&W text document?I ask because it feels like I should have an app that lets me scan with my phone, process the text with OCR, then let me plain text search every scanned document I have.The first part only natively made it into iOS Notes a year or two ago, but that whole experience above should be out of the box, IMHO…

评论 #18554667 未加载

评论 #18557953 未加载

评论 #18554904 未加载

hhanshin超过 6 年前

Found some interesting tidbits in their FAQ [0]: "Q: What type of text can Amazon Textract detect and extract?A: Amazon Textract can detect Latin-script characters from the standard English alphabet and ASCII symbols."So, English only. But very worryingly is that they're going to keep your companies' documents:"Q. Are document and image inputs processed by Amazon Textract stored, and how are they used by AWS?A: Amazon Textract may store and use document and image inputs processed by the service solely to provide and maintain the service and to improve and develop the quality of Amazon Textract...""Q. Can I delete images and documents stored by Amazon Textract?A: Yes. You can request deletion of document and image inputs associated with your account by contacting AWS Support. Deleting image and document inputs may degrade your Amazon Textract experience."

BasHamer超过 6 年前

If this can get me tables out of pdf's generated by crystal reports it would be a godsend for testing. This has been a nightmare to try and solve, the best option so far has been adobe cloud but they don't offer an API for that. I'm excited to try it out.

评论 #18555824 未加载

评论 #18555967 未加载

ocrcustomserver超过 6 年前

Some videos that were just released:Announcing Amazon Textract, <a href="https://www.youtube.com/watch?v=PHX7q4pMGbo" rel="nofollow">https://www.youtube.com/watch?v=PHX7q4pMGbo</a>Introducing Amazon Textract: Now in Preview, <a href="https://www.youtube.com/watch?v=hagvdqofRU4" rel="nofollow">https://www.youtube.com/watch?v=hagvdqofRU4</a>Introducing Amazon Hieroglyph: Now in Preview (AIM363), <a href="https://www.youtube.com/watch?v=FnZFK_2oqKk" rel="nofollow">https://www.youtube.com/watch?v=FnZFK_2oqKk</a>

gingerlime超过 6 年前

I have a personal flow using tesseract to scan docs into searchable PDFs, but it’s not that accurate. One of the main problems is that some (now most?) of the documents are in German since I live in Germany, but some are in English. There’s a way to choose the language but nothing to auto detect as far as I’m aware. I was hoping for some cloud AI service with superior OCR and simple integration or CLI (push a PDF and download one with OCR embedded). Google seems to be too complicated unfortunately... Any tips??

评论 #18554836 未加载

评论 #18557354 未加载

评论 #18554948 未加载

评论 #18555950 未加载

ocrcustomserver超过 6 年前

This is very interesting. I'm curious to see how they will execute on several points:1. How it will deal with multiple templates that the system hasn't seen before. Especially when there is significant difference between the templates.2. UI/UX. E.g. how it will trace the extracted data to the original source and how it will show the confidence scores of each entity.3. Verification process, how will the workflow look like when the confidence score is low and the document has to be checked by human operators.

citilife超过 6 年前

This looks a lot like what I've seen from companies such as InstaBase[1]. Given how hard it is to do well (largely due to poor initial images), I'm curious how Amazon's product offering will work.I a team I'm working with had a lot of success doing this, curious what method(s) they are using.[1] <a href="https://en.wikipedia.org/wiki/Instabase" rel="nofollow">https://en.wikipedia.org/wiki/Instabase</a>

sbarre超过 6 年前

So this is Apache Tika as a Service?<a href="https://tika.apache.org/" rel="nofollow">https://tika.apache.org/</a>

评论 #18581111 未加载

amelius超过 6 年前

Can't use this because my clients/contract don't allow sending of documents to third parties.

评论 #18557256 未加载

ironfootnz超过 6 年前

Amazon Textract may store and use document and image inputs processed by the service solely to provide and maintain the service and to improve and develop the quality of Amazon Textract..."I still prefer the Dropbox solution for that, but I'm waiting them transforming into an API.

jgalt212超过 6 年前

I have been following this service from afar, as the founder is quite skilled. Seems a bit pricey, but does similar.<a href="https://www.pdfdata.io/" rel="nofollow">https://www.pdfdata.io/</a>

blacksmith_tb超过 6 年前

I wonder if they have any detection of captchas, or if they'd let people just submit screengrabs containing them as 'documents' to be processed...

foxhound6超过 6 年前

Any idea if this can support handwriting even with a reduced confidence? Support for non-English languages?

评论 #18554923 未加载

hbcondo714超过 6 年前

Arg, you have to type in all your information even if you are logged into the AWS console

dvtrn超过 6 年前

The FOIA geek in me is....well...geeking out over this. Slightly.

jijji超过 6 年前

This is genius...1. make "strings" api 2. hook it to a web server 3. profit!

20 条评论

cmroanirgo超过 6 年前

评论 #18555729 未加载

评论 #18556685 未加载

评论 #18557671 未加载

评论 #18567348 未加载

danso超过 6 年前

评论 #18554311 未加载

评论 #18557541 未加载

raghavtoshniwal超过 6 年前

评论 #18555419 未加载

评论 #18557011 未加载

Edmond超过 6 年前

Not sure if this is bad news for the Robotic Process Automation (RPA) sector or an opportunity to offload the "Robotic" part while focusing on business process...

评论 #18555825 未加载

评论 #18555803 未加载

efields超过 6 年前

评论 #18554667 未加载

评论 #18557953 未加载

评论 #18554904 未加载

hhanshin超过 6 年前

BasHamer超过 6 年前

评论 #18555824 未加载

评论 #18555967 未加载

ocrcustomserver超过 6 年前

gingerlime超过 6 年前

评论 #18554836 未加载

评论 #18557354 未加载

评论 #18554948 未加载

评论 #18555950 未加载

ocrcustomserver超过 6 年前

citilife超过 6 年前

sbarre超过 6 年前

So this is Apache Tika as a Service?<a href="https://tika.apache.org/" rel="nofollow">https://tika.apache.org/</a>

评论 #18581111 未加载

amelius超过 6 年前

Can't use this because my clients/contract don't allow sending of documents to third parties.

评论 #18557256 未加载

ironfootnz超过 6 年前

jgalt212超过 6 年前

I have been following this service from afar, as the founder is quite skilled. Seems a bit pricey, but does similar.<a href="https://www.pdfdata.io/" rel="nofollow">https://www.pdfdata.io/</a>

blacksmith_tb超过 6 年前

I wonder if they have any detection of captchas, or if they'd let people just submit screengrabs containing them as 'documents' to be processed...

foxhound6超过 6 年前

Any idea if this can support handwriting even with a reduced confidence? Support for non-English languages?

评论 #18554923 未加载

hbcondo714超过 6 年前

Arg, you have to type in all your information even if you are logged into the AWS console

dvtrn超过 6 年前

The FOIA geek in me is....well...geeking out over this. Slightly.

jijji超过 6 年前

This is genius...1. make "strings" api 2. hook it to a web server 3. profit!