Nice post, OP! I was super impressed with the Apple's vision framework. I used it on a personal project involving the OCRing of tens of thousands of spreadsheet screenshots and ingesting them into a postgres database. I tried other OCR CPU methods (since macOS and Nvidia still don't play nice together) such as Tesseract but found the output to be incorrect too often. The vision framework was not only the highest quality output I had seen, but it also used the least amount of compute. It was fairly unstable, but I can chalk that up to user error w/ my implementation.<p>I used a combination of RHetTbull's vision.py (for the actual implementation) [1] + ocrmac (for experimentation) [2] and was pleasantly surprised by the performance on my i7 6700k hackintosh.<p>I wouldn't call myself a programmer but I can generally troubleshoot anything if given enough time, but it did cost time.<p>[1]: <a href="https://gist.github.com/RhetTbull/1c34fc07c95733642cffcd1ac587fc4c" rel="nofollow">https://gist.github.com/RhetTbull/1c34fc07c95733642cffcd1ac5...</a><p>[2]: <a href="https://github.com/straussmaximilian/ocrmac">https://github.com/straussmaximilian/ocrmac</a>
I tried doing something similar on Windows, and realized that PowerToys[1], a Microsoft project I already had installed, actually contains a very good OCR tool[2]. Just press Win+Shift+T and select the area to scan, and the text will be copied to the clipboard.<p>[1] <a href="https://learn.microsoft.com/en-us/windows/powertoys/" rel="nofollow">https://learn.microsoft.com/en-us/windows/powertoys/</a><p>[2] <a href="https://learn.microsoft.com/en-us/windows/powertoys/text-extractor" rel="nofollow">https://learn.microsoft.com/en-us/windows/powertoys/text-ext...</a>
I've built an opensource tool that gives you both CLI and a nice UI. It is free.<p><a href="https://trex.ameba.co" rel="nofollow">https://trex.ameba.co</a>
I did notice that many Mac apps, including Safari and Preview and Notes, do OCR on images automatically. It's pretty neat that I can easily select text in an image and copy and paste it somewhere else.
I'm a huge fan of this little ocr tool isntalled through brew onto my macbook <a href="https://github.com/schappim/macOCR">https://github.com/schappim/macOCR</a>
On Windows I recommend text extractor from powertoys:<p><a href="https://learn.microsoft.com/en-us/windows/powertoys/text-extractor" rel="nofollow">https://learn.microsoft.com/en-us/windows/powertoys/text-ext...</a>
I'll throw my solution into the mix: <a href="https://skaplanofficial.github.io/PyXA/tutorial/images.html#text-extraction" rel="nofollow">https://skaplanofficial.github.io/PyXA/tutorial/images.html#...</a><p>PyXA uses the Vision framework to extract text from one or more images at a time. It's only a small part of the package, so it might be overkill for a one-off operation, but it's an option.
macOS Ventura and newer actually have basic OCR functionality integrated into the Image Capture UI. When using an AirPrint-compatible scanner and scanning to PDF, the checkbox "OCR" is shown in the right pane.
To place contents in a file (not claiming this is the most efficient way but it works)<p>OCRTHISFILE="ocr-test.jpg"<p>shortcuts run ocr-text -i "${OCRTHISFILE}"<p>pbpaste > ${OCRTHISFILE}.txt<p>or to view output and place in file:<p>OCRTHISFILE="ocr-test.jpg"<p>shortcuts run ocr-text -i "${OCRTHISFILE}"<p>pbpaste | tee ${OCRTHISFILE}.txt
Awesome! Is there a similar technique for the Apple vision ‘<i>Copy Subject</i>’ feature? I’ve become extremely reliant on it, but it feels very limited in access.
Very cool, and seems handy!<p>I’ve always had good results from the Preview.app. I wonder how this engine compares for number of errors in a difficult source versus Free alternatives.
Speaking the need of OCRs, I found a comment relevant and quite funny<p>> we already have a common, portable data format for social media. It's screenshots of tweets<p><a href="https://news.ycombinator.com/item?id=38841569">https://news.ycombinator.com/item?id=38841569</a>
I would really love an `ocrmypdf` like tool which uses Apple Vision to create searchable PDFs from scanned images. I've been searching every week or so for some kind of project but so far haven't found anything. Perhaps it's time to make it myself...
I have played around with the OCR on my mac, and have been very impressed. It has been consistently better than tesseract for my purposes.<p>However, when creating a PDF from images using Preview and exporting using ‘Embed Text’ option to OCR, I have noticed the text is worse than if you OCR the exact same images using the shortcut above or using a script. Presumably Preview is using the Vision framework’s less accurate fast path when preparing the PDF. <a href="https://developer.apple.com/documentation/vision/recognizing_text_in_images" rel="nofollow">https://developer.apple.com/documentation/vision/recognizing...</a>
Weird, I couldn't get it to work on a bunch of different files, even using very simple file names. Kept getting this error:<p>Error: The operation couldn’t be completed. (WFBackgroundShortcutRunnerErrorDomain error 1.)
The article was posted.. yesterday, and the <i>entire</i> reason given for not using the builtin Shortcuts sharing feature is... an article from 2 years ago, about a bug in the shortcuts hosting service, which has obviously been fixed.<p>I get that some people will want to create it from scratch themselves or incorporate the actual meat of it into a larger shortcut... but not sharing one that does what the article says, because of a bug 2 years ago, is a bit of a weird take.
If you want to do this a lot easier use: <a href="https://github.com/schappim/macOCR">https://github.com/schappim/macOCR</a>
I don't know why but instead of pasting the text it copied to make sure it worked, I made it read it:<p>shortcuts run ocr-text -i <A PATH TO SOME IMAGE> | say -v Fred
Raycast (macOS only) is also nice as it's able to search images by text. It also allows you to copy text from those images. Quick official demo here: <a href="https://www.youtube.com/watch?v=c96IXGOo6E4" rel="nofollow">https://www.youtube.com/watch?v=c96IXGOo6E4</a>
How to interact with built in OCR via the cli? "Doing" something is (to me) which ocr tooling, what fonts it recognises, all the associated package management and tuning not "how I configure the gui and ui to let me use the tool they shipped with the os"
use LLMs (gpt-4-vision or LLaVA) with aichat<p>`aichat -f tmp/test.png -- output only text in the image`<p><a href="https://github.com/sigoden/aichat">https://github.com/sigoden/aichat</a>
The way I do this: It's built right into the macOS Screenshot app:<p>- Press CMD+SHIFT+4<p>- Draw square on screen where you want to extract the text from<p>- (Quickly) click on the preview image in the lower right corner<p>- Copy text from image
I made a Shortcut + PHP to get text from a screenshot, ask ChatGPT to make a task name from text, and create new task in Clickup and attache a screenshot. Use it often.
Are ios and macos shortcuts crosscompatible? I didnt know there was shortcuts for the mac, seems pretty powerful to be able to run them from the terminal too. Thanks OP
On Windows, A9T9 does a great job of OCR'ing scanned JPEG files (and any JPEG file). It's also free.<p>I scanned about 100 A4 documents in just a couple of minutes.
Surprisingly, the Extract Text from Image action is available on Intel Macs: normally, features like automatic-image-OCR is limited to Apple Silicon Macs.