Ask HN: How to OCR a PDF and preserve whitespace?

26 pointsby GirkovArpa12 months ago

I have some rather large PDFs that need to be transcribed, but every service I try has some minor but deal-breaking flaw.Either they don't support PDFs this large (hundreds of pages), are just really bad at English OCR, or, most commonly, don't preserve whitespace correctly.The number one problem is whitespace when it comes to multiple columns (similar to newspapers). Either not putting any spaces between words, or when there are multiple columns of text, putting rows in the wrong order. If it was just a single page, this would still be useful, since I could fix it myself. But I have over 1000 pages.I tried so many free services and trials that I just got charged for forgetting to cancel one (thanks to smallpdf.com for refunding my $12). Is OCR technology just not there yet when it comes to multiple-column pages? Yet, this does not seem to be an issue with newspapers.com, based on my experience using their text search feature. I would like to know what OCR software they are using.

13 comments

eigenvalue12 months ago

I’ve found that the built in OCR in the iPhone is just way better and more accurate than everything else out there. I’m talking about how if you have an image in your camera roll on the iPhone, you can select the text and copy it out. I had the idea to simply expose that service better so it could be applied to PDFs and made my first ever iPhone app that does this. It can easily handle hundreds of pages too. Only problem is that it’s an iPhone app so it can’t easily be included in a programmatic fashion, but if you’re only trying to convert a reasonable number of documents it’s worth a try— the accuracy is really good:<a href="https://apps.apple.com/us/app/super-pdf-ocr/id6479674248" rel="nofollow">https://apps.apple.com/us/app/super-pdf-ocr/id6479674248</a>

评论 #40618887 未加载

评论 #40618878 未加载

sargstuff12 months ago

Not sure if a multi-step is ok, but convert pdf to image format such as png, use AI to recognize 'tabular blocks', convert pdf to 'text format' with tabular blocks as embeddable image to preserve spacing.<a href="https://stackoverflow.com/questions/3203790/parsing-pdf-files-especially-with-tables-with-pdfbox" rel="nofollow">https://stackoverflow.com/questions/3203790/parsing-pdf-file...</a><a href="https://excalibur-py.readthedocs.io/en/master/" rel="nofollow">https://excalibur-py.readthedocs.io/en/master/</a><a href="https://ledgerbox.io/blog/extract-tables-with-tesseract-ocr" rel="nofollow">https://ledgerbox.io/blog/extract-tables-with-tesseract-ocr</a><a href="https://www.johnsnowlabs.com/extract-tabular-data-from-pdf-in-spark-ocr/" rel="nofollow">https://www.johnsnowlabs.com/extract-tabular-data-from-pdf-i...</a>bit more in-depth review : <a href="https://dev.to/upsilon_it/how-to-extract-tabular-data-from-pdf-part-1-i3" rel="nofollow">https://dev.to/upsilon_it/how-to-extract-tabular-data-from-p...</a>

tacostakohashi12 months ago

Did you try textract? <a href="https://aws.amazon.com/textract/" rel="nofollow">https://aws.amazon.com/textract/</a>In my experience it works amazingly well with columns / tabulated content.

bdowling12 months ago

Many of the free or cheap OCR services are based on the free, open-source Tesseract OCR.<a href="https://github.com/tesseract-ocr/tesseract/">https://github.com/tesseract-ocr/tesseract/</a>Those services usually do not expose all of the options. If you’re handy with shell scripts or Python, you can probably get better performance by hand-tuning options for your particular images. For example, if I recall there are page segmentation options to tell Tesseract to expect multi-column text. That alone might get you better performance than the automatic mode.

constantinum12 months ago

Do give LLMWhisperer[1] a try. It does a good job preserving the layout for the most part — but one cannot escape PDF hell.Try LLMwhisperer Playground[2] with your documents; there is no need for any setup.Extracting multi-column layout example - <a href="https://imgur.com/roYmv0I" rel="nofollow">https://imgur.com/roYmv0I</a>[1] <a href="https://llmwhisperer.unstract.com/" rel="nofollow">https://llmwhisperer.unstract.com/</a> [2] <a href="https://pg.llmwhisperer.unstract.com/" rel="nofollow">https://pg.llmwhisperer.unstract.com/</a>

评论 #40620873 未加载

cpach12 months ago

Can you use ImageMagick to split the columns into single files…? E.g. if the columns are the same on every page, you can feed the coordinates to ImageMagick. Then do OCR on each of those files.

cyanydeez12 months ago

Tabula works for tables.But if you think about space and font widtgs, youll realize nontrivial. Fonts are often variable sizes and table alignments are often left, right or center aligned.A more general tool ive used is PaddleOCR

sandreas12 months ago

Your use case seems very specific. I personally am very happy with ocrmypdf[1], which is free and puts an invisible text layer into the pdf. However, since it is free, I'm pretty sure it cannot compete with the commercial solutions you tried.There also is an older version PDF XChange viewer, that has the ability to do the same thing, although it is presented as "viewer".1: <a href="https://github.com/ocrmypdf/OCRmyPDF/">https://github.com/ocrmypdf/OCRmyPDF/</a>

brianjking12 months ago

LLMWhisperer from Zipstack at <a href="https://llmwhisperer.unstract.com/" rel="nofollow">https://llmwhisperer.unstract.com/</a> or <a href="https://github.com/VikParuchuri/surya">https://github.com/VikParuchuri/surya</a> will do a good job for you.LLMWhisperer has some nice tooling where they can fall back to OCR as well forcing text extraction from scanned documents as well as documents that have the text preserved as text.

shevis12 months ago

Google’s OCR service is reliably accurate and can handle large documents (up to 20MB) <a href="https://cloud.google.com/vision/docs/ocr" rel="nofollow">https://cloud.google.com/vision/docs/ocr</a>

4oo412 months ago

I've had really good results with Python's pdfminer.six, it's also very easy to use.<a href="https://pypi.org/project/pdfminer.six/" rel="nofollow">https://pypi.org/project/pdfminer.six/</a>

halotrope12 months ago

we have an internal api at markets.sh for ingesting large financial reports and newspaper data. solves this exact problem. i‘ll be happy to give you test access. drop me an email at max at markets.sh

rthnbgrredf12 months ago

You could try this one here: <a href="https://github.com/madnight/pdf-layout-text-stripper">https://github.com/madnight/pdf-layout-text-stripper</a>