TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Mistral OCR

1756 pointsby littlemerman2 months ago

120 comments

vikp2 months ago
I ran a partial benchmark against marker - <a href="https:&#x2F;&#x2F;github.com&#x2F;VikParuchuri&#x2F;marker">https:&#x2F;&#x2F;github.com&#x2F;VikParuchuri&#x2F;marker</a> .<p>Across 375 samples with LLM as a judge, mistral scores 4.32, and marker 4.41 . Marker can inference between 20 and 120 pages per second on an H100.<p>You can see the samples here - <a href="https:&#x2F;&#x2F;huggingface.co&#x2F;datasets&#x2F;datalab-to&#x2F;marker_comparison_mistral_llm" rel="nofollow">https:&#x2F;&#x2F;huggingface.co&#x2F;datasets&#x2F;datalab-to&#x2F;marker_comparison...</a> .<p>The code for the benchmark is here - <a href="https:&#x2F;&#x2F;github.com&#x2F;VikParuchuri&#x2F;marker&#x2F;tree&#x2F;master&#x2F;benchmarks">https:&#x2F;&#x2F;github.com&#x2F;VikParuchuri&#x2F;marker&#x2F;tree&#x2F;master&#x2F;benchmark...</a> . Will run a full benchmark soon.<p>Mistral OCR is an impressive model, but OCR is a hard problem, and there is a significant risk of hallucinations&#x2F;missing text with LLMs.
评论 #43286925 未加载
评论 #43286406 未加载
评论 #43289399 未加载
评论 #43287217 未加载
评论 #43287932 未加载
评论 #43290561 未加载
评论 #43287277 未加载
评论 #43287580 未加载
评论 #43288534 未加载
bambax2 months ago
It&#x27;s not bad! But it still hallucinates. Here&#x27;s an example of an (admittedly difficult) image:<p><a href="https:&#x2F;&#x2F;i.imgur.com&#x2F;jcwW5AG.jpeg" rel="nofollow">https:&#x2F;&#x2F;i.imgur.com&#x2F;jcwW5AG.jpeg</a><p>For the blocks in the center, it outputs:<p>&gt; <i>Claude, duc de Saint-Simon, pair et chevalier des ordres, gouverneur de Blaye, Senlis, etc., né le 16 août 1607 , 3 mai 1693 ; ép. 1○, le 26 septembre 1644, Diane - Henriette de Budos de Portes, morte le 2 décembre 1670; 2○, le 17 octobre 1672, Charlotte de l&#x27;Aubespine, morte le 6 octobre 1725.</i><p>This is perfect! But then the next one:<p>&gt; <i>Louis, commandeur de Malte, Louis de Fay Laurent bre 1644, Diane - Henriette de Budos de Portes, de Cressonsac. du Chastelet, mortilhomme aux gardes, 2 juin 1679.</i><p>This is really bad because<p>1&#x2F; a portion of the text of the previous bloc is repeated<p>2&#x2F; a portion of the next bloc is imported here where it shouldn&#x27;t be (&quot;Cressonsac&quot;), and of the right most bloc (&quot;Chastelet&quot;)<p>3&#x2F; but worst of all, a whole word is invented, &quot;mortilhomme&quot; that appears nowhere in the original. (The word doesn&#x27;t exist in French so in that case it would be easier to spot; but the risk is when words are invented, that do exist and &quot;feel right&quot; in the context.)<p>(Correct text for the second bloc should be:<p>&gt; <i>Louis, commandeur de Malte, capitaine aux gardes, 2 juin 1679.</i>)
评论 #43285723 未加载
评论 #43285451 未加载
评论 #43285501 未加载
评论 #43286818 未加载
评论 #43287918 未加载
评论 #43286599 未加载
评论 #43286573 未加载
owenpalmer2 months ago
This is incredibly exciting. I&#x27;ve been pondering&#x2F;experimenting on a hobby project that makes reading papers and textbooks easier and more effective. Unfortunately the OCR and figure extraction technology just wasn&#x27;t there yet. This is a game changer.<p>Specifically, this allows you to associate figure references with the actual figure, which would allow me to build a UI that solves the annoying problem of looking for a referenced figure on another page, which breaks up the flow of reading.<p>It also allows a clean conversion to HTML, so you can add cool functionality like clicking on unfamiliar words for definitions, or inserting LLM generated checkpoint questions to verify understanding. I would like to see if I can automatically integrate Andy Matuschak&#x27;s Orbit[0] SRS into any PDF.<p>Lots of potential here.<p>[0] <a href="https:&#x2F;&#x2F;docs.withorbit.com&#x2F;" rel="nofollow">https:&#x2F;&#x2F;docs.withorbit.com&#x2F;</a>
评论 #43285304 未加载
评论 #43284141 未加载
raunakchowdhuri2 months ago
We ran some benchmarks comparing against Gemini Flash 2.0. You can find the full writeup here: <a href="https:&#x2F;&#x2F;reducto.ai&#x2F;blog&#x2F;lvm-ocr-accuracy-mistral-gemini">https:&#x2F;&#x2F;reducto.ai&#x2F;blog&#x2F;lvm-ocr-accuracy-mistral-gemini</a><p>A high level summary is that while this is an impressive model, it underperforms even current SOTA VLMs on document parsing and has a tendency to hallucinate with OCR, table structure, and drop content.
评论 #43288700 未加载
评论 #43287701 未加载
Asraelite2 months ago
I never thought I&#x27;d see the day where technology finally advanced far enough that we can edit a PDF.
评论 #43284044 未加载
评论 #43285052 未加载
评论 #43284345 未加载
kbyatnal2 months ago
We&#x27;re approaching the point where OCR becomes &quot;solved&quot; — very exciting! Any legacy vendors providing pure OCR are going to get steamrolled by these VLMs.<p>However IMO, there&#x27;s still a large gap for businesses in going from raw OCR outputs —&gt; document processing deployed in prod for mission-critical use cases. LLMs and VLMs aren&#x27;t magic, and anyone who goes in expecting 100% automation is in for a surprise.<p>You still need to build and label datasets, orchestrate pipelines (classify -&gt; split -&gt; extract), detect uncertainty and correct with human-in-the-loop, fine-tune, and a lot more. You can certainly get close to full automation over time, but it&#x27;s going to take time and effort. But the future is on the horizon!<p>Disclaimer: I started a LLM doc processing company to help companies solve problems in this space (<a href="https:&#x2F;&#x2F;extend.app&#x2F;">https:&#x2F;&#x2F;extend.app&#x2F;</a>)
评论 #43284645 未加载
评论 #43284583 未加载
评论 #43284920 未加载
评论 #43285688 未加载
评论 #43286205 未加载
mvac2 months ago
Great progress, but unfortunately, for our use case (converting medical textbooks from PDF to MD), the results are not as good as those by MinerU&#x2F;PDF-Extract-Kit [1].<p>Also the collab link in the article is broken, found a functional one [2] in the docs.<p>[1] <a href="https:&#x2F;&#x2F;github.com&#x2F;opendatalab&#x2F;MinerU">https:&#x2F;&#x2F;github.com&#x2F;opendatalab&#x2F;MinerU</a> [2] <a href="https:&#x2F;&#x2F;colab.research.google.com&#x2F;github&#x2F;mistralai&#x2F;cookbook&#x2F;blob&#x2F;main&#x2F;mistral&#x2F;ocr&#x2F;structured_ocr.ipynb#scrollTo=svaJGBFlqm7_" rel="nofollow">https:&#x2F;&#x2F;colab.research.google.com&#x2F;github&#x2F;mistralai&#x2F;cookbook&#x2F;...</a>
评论 #43284926 未加载
评论 #43285997 未加载
shekhargulati2 months ago
Mistral OCR made multiple mistakes in extracting this [1] document. It is a two-page-long PDF in Arabic from the Saudi Central Bank. The following errors were observed:<p>- Referenced Vision 2030 as Vision 2.0. - Failed to extract the table; instead, it hallucinated and extracted the text in a different format. - Failed to extract the number and date of the circular.<p>I tested the same document with ChatGPT, Claude, Grok, and Gemini. Only Claude 3.7 extracted the complete document, while all others failed badly. You can read my analysis here [2].<p>1. <a href="https:&#x2F;&#x2F;rulebook.sama.gov.sa&#x2F;sites&#x2F;default&#x2F;files&#x2F;en_net_file_store&#x2F;SAMA_EN_10395_VER1.pdf" rel="nofollow">https:&#x2F;&#x2F;rulebook.sama.gov.sa&#x2F;sites&#x2F;default&#x2F;files&#x2F;en_net_file...</a> 2. <a href="https:&#x2F;&#x2F;shekhargulati.com&#x2F;2025&#x2F;03&#x2F;05&#x2F;claude-3-7-sonnet-is-good-at-pdf-processing&#x2F;" rel="nofollow">https:&#x2F;&#x2F;shekhargulati.com&#x2F;2025&#x2F;03&#x2F;05&#x2F;claude-3-7-sonnet-is-go...</a>
vessenes2 months ago
Dang. Super fast and significantly more accurate than google, Claude and others.<p>Pricing : $1&#x2F;1000 pages, or per 2k pages if “batched”. I’m not sure what batching means in this case: multiple pdfs? Why not split them to halve the cost?<p>Anyway this looks great at pdf to markdown.
评论 #43283166 未加载
评论 #43283158 未加载
评论 #43283851 未加载
评论 #43283181 未加载
评论 #43283777 未加载
评论 #43283156 未加载
serjester2 months ago
This is cool! With that said for anyone looking to use this in RAG, the downside to specialized models instead of general VLMs is you can&#x27;t easily tune it to your use specific case. So for example, we use Gemini to add very specific alt text to images in the extracted Markdown. It&#x27;s also 2 - 3X the cost of Gemini Flash - hopefully the increased performance is significant.<p>Regardless excited to see more and more competition in the space.<p>Wrote an article on it: <a href="https:&#x2F;&#x2F;www.sergey.fyi&#x2F;articles&#x2F;gemini-flash-2-tips" rel="nofollow">https:&#x2F;&#x2F;www.sergey.fyi&#x2F;articles&#x2F;gemini-flash-2-tips</a>
评论 #43284649 未加载
评论 #43284138 未加载
sbarre2 months ago
6 years ago I was working with a very large enterprise that was struggling to solve this problem, trying to scan millions of arbitrary forms and documents per month to clearly understand key points like account numbers, names and addresses, policy numbers, phone numbers, embedded images or scribbled notes, and also draw relationships between these values on a given form, or even across forms.<p>I wasn&#x27;t there to solve that specific problem but it was connected to what we were doing so it was fascinating to hear that team talk through all the things they&#x27;d tried, from brute-force training on templates (didn&#x27;t scale as they had too many kinds of forms) to every vendor solution under the sun (none worked quite as advertised on their data)..<p>I have to imagine this is a problem shared by so many companies.
opwieurposiu2 months ago
Related, does anyone know of an app that can read gauges from an image and log the number to influx? I have a solar power meter in my crawlspace, it is inconvenient to go down there. I want to point an old phone at it and log it so I can check it easily. The gauge is digital and looks like this:<p><a href="https:&#x2F;&#x2F;www.pvh2o.com&#x2F;solarShed&#x2F;firstPower.jpg" rel="nofollow">https:&#x2F;&#x2F;www.pvh2o.com&#x2F;solarShed&#x2F;firstPower.jpg</a>
评论 #43283638 未加载
评论 #43283494 未加载
评论 #43283741 未加载
评论 #43283582 未加载
评论 #43284088 未加载
evmar2 months ago
I noticed on the Arabic example they lost a space after the first letter on the third to last line, can any native speakers confirm? (I only know enough Arabic to ask dumb questions like this, curious to learn more.)<p>Edit: it looks like they also added a vowel mark not present in the input on the line immediately after.<p>Edit2: here&#x27;s a picture of what I&#x27;m talking about, the before&#x2F;after: <a href="https:&#x2F;&#x2F;ibb.co&#x2F;v6xcPMHv" rel="nofollow">https:&#x2F;&#x2F;ibb.co&#x2F;v6xcPMHv</a>
评论 #43284285 未加载
评论 #43285484 未加载
lysace2 months ago
Nit: Please change the URL from<p><a href="https:&#x2F;&#x2F;mistral.ai&#x2F;fr&#x2F;news&#x2F;mistral-ocr" rel="nofollow">https:&#x2F;&#x2F;mistral.ai&#x2F;fr&#x2F;news&#x2F;mistral-ocr</a><p>to<p><a href="https:&#x2F;&#x2F;mistral.ai&#x2F;news&#x2F;mistral-ocr" rel="nofollow">https:&#x2F;&#x2F;mistral.ai&#x2F;news&#x2F;mistral-ocr</a><p>The article is the same, but the site navigation is in English instead of French.<p>Unless it&#x27;s a silent statement, of course. =)
评论 #43285120 未加载
porphyra2 months ago
I uploaded a picture of my Chinese mouthwash [0] and it made a ton of mistakes and hallucinated a lot. Very disappointing. For example it says the usage instructions is to use 80 ml each time, even though the actual usage instruction on the bottle says use 5-20 mL each time, three times a day, and gargle for 1 minute.<p>[0] <a href="https:&#x2F;&#x2F;i.imgur.com&#x2F;JiX9joY.jpeg" rel="nofollow">https:&#x2F;&#x2F;i.imgur.com&#x2F;JiX9joY.jpeg</a><p>[1] <a href="https:&#x2F;&#x2F;chat.mistral.ai&#x2F;chat&#x2F;8df2c9b9-ee72-414b-81c3-843ce74e1965" rel="nofollow">https:&#x2F;&#x2F;chat.mistral.ai&#x2F;chat&#x2F;8df2c9b9-ee72-414b-81c3-843ce74...</a>
ChemSpider2 months ago
&quot;World&#x27;s best OCR model&quot; - that is quite a statement. Are there any well-known benchmarks for OCR software?
评论 #43283464 未加载
评论 #43283578 未加载
评论 #43283385 未加载
评论 #43285286 未加载
neom2 months ago
I gave it a bunch of my wifes 18th century English scans to transcribe, mostly couldn&#x27;t do it, and it&#x27;s been doing this for 15 minutes now, not sure why but i find quite amusing: <a href="https:&#x2F;&#x2F;share.zight.com&#x2F;L1u2jZYl" rel="nofollow">https:&#x2F;&#x2F;share.zight.com&#x2F;L1u2jZYl</a>
blackeyeblitzar2 months ago
A similar but different product that was discussed on HN is OlmOCR from AI2, which is open source:<p><a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=43174298">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=43174298</a>
SilentM682 months ago
I would like to see how it performs with massively warped and skewed scanned text images, basically a scanned image where the text lines are wavy as opposed as straight horizontal, where the letters are elongated. One where the line widths are different depending on the position on the scanned image. I once had to deal with such a task that somebody gave me with OCR software, Acrobat, and other tools could not decode the mess so I had to recreate the 30 pages myself, manually. Not a fun thing to do but that is a real use case.
评论 #43289262 未加载
评论 #43285943 未加载
评论 #43283771 未加载
janalsncm2 months ago
The hard ones are things like contracts, leases, and financial documents which 1) don’t have a common format 2) are filled with numbers proper nouns and addresses which it’s <i>really</i> important not to mess up 3) cannot be inferred from context.<p>Typical OCR pipeline would be to pass the doc through a character-level OCR system then correct errors with a statistical model like an LLM. An LLM can help correct “crodit card” to “credit card” but it cannot correct names or numbers. It’s really bad if it replaces a 7 with a 2.
raffraffraff2 months ago
Forgive my absolute ignorance, I should probably run this through a chat bot before posting ... So I&#x27;m updating my post with answers now!<p>Q: Do LLMs specialise in &quot;document level&quot; recognition based on headings, paragraphs, columns tables etc? Ie: ignore words and characters for now and attempt to recognise a known document format.<p>A: Not most LLMs, but those with multimodal &#x2F; vision capability could (eg DeepSeek Vision. ChatGPT 4). There are specialized models for this work like Tesseract, LayoutLM.<p>Q: How did OCR work &quot;back in the day&quot; before we had these LLMs? Are any of these methods useful now?<p>A: They used pattern recognition and feature extraction, rules and templates. Newer ML based OCR used SVM to isolate individual characters and HMM to predict the next character or word. Today&#x27;s multimodal models process images and words, can handle context better than the older methods, and can recognise whole words or phrases instead of having to read each character perfectly. This is why they can produce better results but with hallucinations.<p>Q: Can LLMs rate their own confidence in each section, maybe outputting text with annotations that say &quot;only 10% certain of this word&quot;, and pass the surrounding block through more filters, different LLMs, different methods to try to improve that confidence?<p>A: Short answer, &quot;no&quot;. But you can try to estimate with post processing.<p>Or am I super naive, and all of those methods are already used by the big commercial OCR services like Textract etc?
评论 #43351617 未加载
sireat2 months ago
Intriguing announcement, however the examples on the mistral.ai page seem rather &quot;easy&quot;.<p>What about rare glyphs in different languages using handwriting from previous centuries?<p>I&#x27;ve been dealing with OCR issues and evaluating different approaches for past 5+ years at a national library that I work at.<p>Usual consensus is that widely used open source Tesseract is subpar to commercial models.<p>That might be so without fine tuning. However one can perform supplemental training and build your own Tesseract models that can outperform the base ones.<p>Case study of Kant&#x27;s letter&#x27;s from 18th century:<p>About 6 months ago, I tested OpenAi approach to OCR to some old 18th century letters that needed digitizing.<p>The results were rather good (90+% accuracy) with the usual hallucination here and there.<p>What was funny that OpenAI was using base Tesseract to generate the segmenting and initial OCR.<p>The actual OCRed content before last inference step was rather horrid because the Tesseract model that OpenAi was using was not appropriate for the particular image.<p>When I took OpenAi off the first step and moved to my own Tesseract models, I gained significantly in &quot;raw&quot; OCR accuracy at character level.<p>Then I performed normal LLM inference at the last step.<p>What was a bit shocking: My actual gains for the task (humanly readable text for general use) were not particularly significant.<p>That is LLMs are fantastic at &quot;untangling&quot; complete mess of tokens into something humanly readable.<p>For example:<p>P!3goattie -&gt; prerogative (that is given the surrounding text is similarly garbled)
cxie2 months ago
The new Mistral OCR release looks impressive - 94.89% overall accuracy and significantly better multilingual support than competitors. As someone who&#x27;s built document processing systems at scale, I&#x27;m curious about the real-world implications.<p>Has anyone tried this on specialized domains like medical or legal documents? The benchmarks are promising, but OCR has always faced challenges with domain-specific terminology and formatting.<p>Also interesting to see the pricing model ($1&#x2F;1000 pages) in a landscape where many expected this functionality to eventually be bundled into base LLM offerings. This feels like a trend where previously encapsulated capabilities are being unbundled into specialized APIs with separate pricing.<p>I wonder if this is the beginning of the componentization of AI infrastructure - breaking monolithic models into specialized services that each do one thing extremely well.
评论 #43283785 未加载
评论 #43283658 未加载
评论 #43283597 未加载
评论 #43283593 未加载
评论 #43284302 未加载
评论 #43283517 未加载
评论 #43284641 未加载
评论 #43283540 未加载
评论 #43283520 未加载
评论 #43286056 未加载
评论 #43283465 未加载
notepad0x902 months ago
I was just watching a science-related video containing math equations. I wondered how soon will I be able to ask the video player &quot;What am I looking at here, describe the equations&quot; and it will OCR the frames, analyze them and explain them to me.<p>It&#x27;s only a matter of time before &quot;browsing&quot; means navigating HTTP sites via LLM prompts. although, I think it is critical that LLM input should NOT be restricted to verbal cues. Not everyone is an extrovert that longs to hear the sound of their own voices. A lot of human communication is non-verbal.<p>Once we get over the privacy implications (and I do believe this can only be done by worldwide legislative efforts), I can imagine looking at a &quot;website&quot; or video, and my expressions, mannerisms and gestures will be considered prompts.<p>At least that is what I imagine the tech would evolve into in 5+ years.
评论 #43283372 未加载
评论 #43283309 未加载
评论 #43283318 未加载
groby_b2 months ago
Perusing the web site, it&#x27;s depressing how much behind Mistral is on basic &quot;how can I make this a compelling hook for customers&quot; for the page.<p>The notebook link? An ACL&#x27;d doc<p>The examples don&#x27;t even include a small text-to-markdown sample.<p>The before&#x2F;after slider is cute, but useless - SxS is a much better way to compare.<p>Trying it in &quot;Le Chat&quot; requires a login.<p>It&#x27;s like an example of &quot;how can we implement maximum loss across our entire funnel&quot;. (I have no doubt the underlying tech does well, but... damn, why do you make it so hard to actually see it, Mistral?)<p>If anybody tried it and has shareable examples - can you post a link? Also, anybody tried it with handwriting yet?
michaelbuckbee2 months ago
I&#x27;d mentioned this on HN last month, but I took a picture of a grocery list and then pasted it into ChatGPT to have it written out and it worked flawlessly...until I discovered that I&#x27;d messed up the picture when I took it at an angle and had accidentally cut off the first character or two of the bottom half of the list.<p>ChatGPT just inferred that I wanted the actual full names of the items (aka &quot;flour&quot; instead of &quot;our&quot;).<p>Depending on how you feel about it, this is either an absolute failure of OCR or wildly useful and much better.
TriangleEdge2 months ago
One of my hobby projects while in University was to do OCR on book scans. Doing character recognition was solved, but finding the relationship between characters was very difficult. I tried &quot;primitive&quot; neural nets, but edge cases would often break what I built. Super cool to me to see such an order of magnitude in improvement here.<p>Does it do hand written notes and annotations? What about meta information like highlighting? I am also curious if LLMs will get better because more access to information if it can be effectively extracted from PDFs.
评论 #43283336 未加载
s4i2 months ago
I wonder how good it would be to convert sheet music to MusicXML. All the current tools more or less suck with this task, or maybe I’m just ignorant and don’t know what lego bricks to put together.
评论 #43284685 未加载
z22 months ago
Is there a reliable handwriting OCR benchmark out there (updated, not a blog post)? Despite the gains claimed for printed text, I found (anecdotally) that trying to use Mistral OCR on my messy cursive handwriting to be much less accurate than GPT-4o, in the ballpark of 30% wrong vs closer to 5% wrong for GPT-4o.<p>Edit: answered in another post: <a href="https:&#x2F;&#x2F;huggingface.co&#x2F;spaces&#x2F;echo840&#x2F;ocrbench-leaderboard" rel="nofollow">https:&#x2F;&#x2F;huggingface.co&#x2F;spaces&#x2F;echo840&#x2F;ocrbench-leaderboard</a>
评论 #43283425 未加载
oysterville2 months ago
Dupe of an hour previous post <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=43282489">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=43282489</a>
jojogh2 months ago
High accuracy is the goal! But the multimodal approach introduces some complexities that can impact real-world performance. We break it down in our review: <a href="https:&#x2F;&#x2F;undatas.io&#x2F;blog&#x2F;posts&#x2F;in-depth-review-of-mistral-ocr-a-pdf-parsing-powerhouse-tailored-for-the-ai-era&#x2F;" rel="nofollow">https:&#x2F;&#x2F;undatas.io&#x2F;blog&#x2F;posts&#x2F;in-depth-review-of-mistral-ocr...</a> As for use cases, it really depends on how well it handles edge cases…
qwertox2 months ago
We developers seem to really dislike PDFs, to a degree that we&#x27;ll build LLMs and have them translate it into Markdown.<p>Jokes aside, PDFs really serve a good purpose, but getting data out of them is usually really hard. They should have something like an embedded Markdown version with a JSON structure describing the layout, so that machines can easily digest the data they contain.
评论 #43284514 未加载
评论 #43287338 未加载
climb_stealth2 months ago
Does this support Japanese? They list a table of language comparisons againat other approaches but I can&#x27;t tell if it is exhaustive.<p>I&#x27;m hoping that something like this will be able to handle 3000-page Japanese car workshop manuals. Because traditional OCR really struggles with it. It has tables, graphics, text in graphics, the whole shebang.
protonbob2 months ago
Wow this basically &quot;solves&quot; DRM for books as well as opening up the door for digitizing old texts more accurately.
bsnnkv2 months ago
Someone working there has good taste to include a Nizar Qabbani poem.
andoando2 months ago
Bit unrelated but is there anything that can help with really low resolution text? My neighbor got hit and run the other day for example, and I&#x27;ve been trying every tool I can to make out some of the letters&#x2F;numbers on the plate<p><a href="https:&#x2F;&#x2F;ibb.co&#x2F;mr8QSYnj" rel="nofollow">https:&#x2F;&#x2F;ibb.co&#x2F;mr8QSYnj</a>
评论 #43283370 未加载
评论 #43283521 未加载
评论 #43283436 未加载
评论 #43283319 未加载
评论 #43283356 未加载
yoeven2 months ago
I ran Mistral AI OCR against JigsawStack OCR and beat their model in every category. Full breakdown here: <a href="https:&#x2F;&#x2F;jigsawstack.com&#x2F;blog&#x2F;mistral-ocr-vs-jigsawstack-vocr" rel="nofollow">https:&#x2F;&#x2F;jigsawstack.com&#x2F;blog&#x2F;mistral-ocr-vs-jigsawstack-vocr</a>
评论 #43289561 未加载
jacooper2 months ago
Pretty cool, would love to use this with paperless, but I just can&#x27;t bring myself to send a photo of all my documents to a third party, especially legal and sensitive documents, which is what I use Paperless for.<p>Because of that I&#x27;m stuck with crappy vision on Ollama (Thanks to AMDs crappy ROCm support for Vllm)
InvidFlower2 months ago
While it is nice to have more options, it still definitely isn&#x27;t at a human level yet for hard to read text. Still haven&#x27;t seen anything that can deal with something like this very well: <a href="https:&#x2F;&#x2F;i.imgur.com&#x2F;n2sBFdJ.jpeg" rel="nofollow">https:&#x2F;&#x2F;i.imgur.com&#x2F;n2sBFdJ.jpeg</a><p>If I remember right, Gemini actually was the closest as far as accuracy of the parts where it &quot;behaved&quot;, but it&#x27;d start to go off the rails and reword things at the end of larger paragraphs. Maybe if the image was broken up into smaller chunks. In comparison, Mistral for the most part (besides on one particular line for some reason) sticks to the same number of words, but gets a lot wrong on the specifics.
hdjrudni2 months ago
Still terrible at handwriting.<p>I signed up for the API, cobbled together from their tutorial (<a href="https:&#x2F;&#x2F;docs.mistral.ai&#x2F;capabilities&#x2F;document&#x2F;" rel="nofollow">https:&#x2F;&#x2F;docs.mistral.ai&#x2F;capabilities&#x2F;document&#x2F;</a>) -- why can&#x27;t they give the full script instead of little bits?<p>Tried uploading a tiff, they rejected it. Tried upload JPG, they rejected it (even though they supposed support images?). Tried resaving as PDF. It took that, but the output was just bad. Then tried ChatGPT on the original .tiff (not using API), and it got it perfectly. Honestly I could barely make out the handwriting with my eyes but now that I see ChatGPT&#x27;s version I think it&#x27;s right.
评论 #43290336 未加载
hubraumhugo2 months ago
It will be interesting to see how all the companies in the document processing space adapt as OCR becomes a commodity.<p>The best products will be defined by everything &quot;non-AI&quot;, like UX, performance and reliability at scale, and human-in-the loop feedback for domain experts.
评论 #43284352 未加载
评论 #43284684 未加载
sixhobbits2 months ago
Nice demos but I wonder how well it does on longer files. I&#x27;ve been experimenting with passing some fairly neat PDFs to various LLMs for data extraction. They&#x27;re created from Excel exports and some of the data is cut off or badly laid out, but it&#x27;s all digitally extractable.<p>The challenge isn&#x27;t so much the OCR part, but just the length. After one page the LLMs get &quot;lazy&quot; and just skip bits or stop entirely.<p>And page by page isn&#x27;t trivial as header rows are repeated or missing etc.<p>So far my experience has definitely been that the last 2% of the content still takes the most time to accurately extract for large messy documents, and LLMs still don&#x27;t seem to have a one-shot solve for that. Maybe this is it?
评论 #43284913 未加载
egorfine2 months ago
I had a need to scan serial numbers from Apple&#x27;s product boxes out of pictures taken by a random person on their phone.<p>All OCR tools that I have tried have failed. Granted, I would get much better results if I used OpenCV to detect the label, rotate&#x2F;correct it, normalize contrast, etc.<p>But... I have tried the then new vision model from OpenAI and it did the trick so well it&#x27;s wasn&#x27;t feasible to consider anything else at that point.<p>I have checked all S&#x2F;N afterwards for being correct via third-party API - and all of theme were. Sure, sometimes I had to check versions with 0&#x2F;o and i&#x2F;l&#x2F;1 substitutions but I believe these kind of mistakes are non-issues.
dotnetkow2 months ago
Congrats to the Mistral team for launching! A general-purpose OCR model is useful, of course. However, more purpose-built solutions are a must to convert business documents reliably. AI models pre-trained on specific document types perform better and are more accurate. Coming soon from the ABBYY team, we&#x27;re shipping a new OCR API designed to be consistent, reliable, and hallucination-free. Check it out if you&#x27;re looking for best-in-class DX: <a href="https:&#x2F;&#x2F;digital.abbyy.com&#x2F;code-extract-automate-your-new-must-have-ocr-api-coming-soon" rel="nofollow">https:&#x2F;&#x2F;digital.abbyy.com&#x2F;code-extract-automate-your-new-mus...</a>
pqdbr2 months ago
I tried with both PDFs and PNGs in Le Chat and the results were the worst I&#x27;ve ever seen when compared to any other model (Claude, ChatGPT, Gemini).<p>So bad that I think I need to enable the OCR function somehow, but couldn&#x27;t find it.
评论 #43285110 未加载
评论 #43284885 未加载
bob10292 months ago
&gt; It takes images and PDFs as input<p>If you are working with PDF, I would suggest a hybrid process.<p>It is feasible to extract information with 100% accuracy from PDFs that were generated using the mappable acrofields approach. In many domains, you have a fixed set of forms you need to process and this can be leveraged to build a custom tool for extracting the data.<p>Only if the PDFs are unknown or were created by way of a cellphone camera, multifunction office device, etc should you need to reach for OCR.<p>The moment you need to use this kind of technology you are in a completely different regime of what the business will (should) tolerate.
评论 #43283636 未加载
kapitalx2 months ago
Co-founder of doctly.ai here (OCR tool)<p>I love mistral and what they do. I got really excited about this, but a little disappointed after my first few tests.<p>I tried a complex table that we use as a first test of any new model, and Mistral OCR decided the entire table should just be extracted as an &#x27;image&#x27; and returned this markdown:<p>``` ![img-0.jpeg](img-0.jpeg) ```<p>I&#x27;ll keep testing, but so far, very disappointing :(<p>This document I try is the entire reason we created Doctly to begin with. We needed an OCR tool for regulatory documents we use and nothing could really give us the right data.<p>Doctly uses a judge, OCRs a document against multiple LLMs and decides which one to pick. It will continue to run the page until the judge scores above a certain score.<p>I would have loved to add this into the judge list, but might have to skip it.
评论 #43284176 未加载
评论 #43283892 未加载
评论 #43283913 未加载
评论 #43284295 未加载
评论 #43286938 未加载
评论 #43283692 未加载
评论 #43283698 未加载
jervant2 months ago
I wonder how it compares to USPS workers at deciphering illegible handwriting.
Oras2 months ago
I feel this is created for RAG. I tried a document [0] that I tested with OCR; it got all the table values correctly, but the page&#x27;s footer was missing.<p>Headers and footers are a real pain with RAG applications, as they are not required, and most OCR or PDF parsers will return them, and there is extract work to do to remove them.<p>[0] <a href="https:&#x2F;&#x2F;github.com&#x2F;orasik&#x2F;parsevision&#x2F;blob&#x2F;main&#x2F;example&#x2F;MultiPageInvoice.pdf">https:&#x2F;&#x2F;github.com&#x2F;orasik&#x2F;parsevision&#x2F;blob&#x2F;main&#x2F;example&#x2F;Mult...</a>
mjnews2 months ago
&gt; Mistral OCR has shown impressive performance, but OCR remains a challenging problem, especially with the risk of hallucinations and missing text in LLM-based approaches. For those interested in exploring its capabilities further, the official site provides more details: [Mistral OCR](<a href="https:&#x2F;&#x2F;www.mistralocr.org" rel="nofollow">https:&#x2F;&#x2F;www.mistralocr.org</a>). It would be great to see more benchmarks comparing different OCR solutions in real-world scenarios.
yoelhacks2 months ago
I was curious about Mistral so I made a few visualizations.<p>A high level diagram w&#x2F; links to files: <a href="https:&#x2F;&#x2F;eraser.io&#x2F;git-diagrammer?diagramId=uttKbhgCgmbmLp8OFf9R" rel="nofollow">https:&#x2F;&#x2F;eraser.io&#x2F;git-diagrammer?diagramId=uttKbhgCgmbmLp8OF...</a><p>Specific flow of an OCR request: <a href="https:&#x2F;&#x2F;eraser.io&#x2F;git-diagrammer?diagramId=CX46d1Jy5Gsg3QDzPOah" rel="nofollow">https:&#x2F;&#x2F;eraser.io&#x2F;git-diagrammer?diagramId=CX46d1Jy5Gsg3QDzP...</a><p>(Disclaimer - uses a tool I&#x27;ve been working on)
lingjiekong2 months ago
Curious that have people find more details regarding what is the architecture of this &quot;mistral-ocr-latest&quot;. I have two question that<p>1. I was initially thinking this is VLM parsing model until I saw it can extract images. Then, I assume it is a pipeline of an image extraction and a VLM model while their result is combined to give the final result.<p>2. In this case, benchmark the pipeline result vs a end to end VLM such as gemini 2.0 flash might not be apple to apple comparison.
pawelduda2 months ago
It outperforms the competition significantly AND can extract embedded images from the text. I really like LLMs for OCR more and more. Gemini was already pretty good at it
strangescript2 months ago
I think its interesting they left out Gemini 2.0 Pro in the benchmarks which I find to be markedly better than flash if you don&#x27;t mind the spend.
coolspot2 months ago
This is $1 per 1000 pages. For comparison, Azure Document Intelligence is $1.5&#x2F;1000 pages for general OCR and $30&#x2F;1000 pages for “custom extraction”.
评论 #43284211 未加载
srinathkrishna2 months ago
Given the fact that multi-modal LLMs are getting so good at OCR these days, is it a shame that we can&#x27;t do local OCR with high accuracy in the near-term?
sureglymop2 months ago
Looks good but in the first hover&#x2F;slider demo one can see how it could lead to confusion when handling side by side content.<p>Table 1 is referred to in section `2 Architectural details` but before `2.1 Multimodal Decoder`. In the generated markdown though it is below the latter section, as if it was in&#x2F;part of that section.<p>Of course I am nitpicking here but just the first thing I noticed.
评论 #43284140 未加载
soyyo2 months ago
I understand that is more juicy to get information from graphs, figures and so on, as every domain uses those, but i really hope to eventually see these models to be able to workout music notation, i have tried the best known apps and all of them fail to capture important details such as guitar performace symbols for bends or legato
peterburkimsher2 months ago
Does it work for video subtitles? And in Chinese? I’m looking to transcribe subtitles of live music recordings from ANHOP and KHOP.
th0ma52 months ago
A great question for people wanting to use OCR in business is... Which digits in monetary amounts can you tolerate being incorrect?
评论 #43284442 未加载
rvz2 months ago
&gt; &quot;Fastest in its category&quot;<p>Not one mention of the company that they have partnered with and that is Cerebras AI and that is the reason they have fast inference [0]<p>Literally no-one here is talking about them and they are about to IPO.<p>[0] <a href="https:&#x2F;&#x2F;cerebras.ai&#x2F;blog&#x2F;mistral-le-chat" rel="nofollow">https:&#x2F;&#x2F;cerebras.ai&#x2F;blog&#x2F;mistral-le-chat</a>
roboben2 months ago
Le chat doesn’t seem to know about this change despite the blog post stating it. Can anyone explain how to use it in Le Chat?
评论 #43284408 未加载
评论 #43285131 未加载
aperrien2 months ago
Is this model open source?
评论 #43283546 未加载
low_tech_punk2 months ago
This might be a contrarian take: the improvement against gpt-4o and gemini-1.5 flash, both of which are general purpose multi-modal models, seem to be underwhelming.<p>I&#x27;m sensing another bitter lesson coming, where domain optimized AI will hold a short term advantage but will be outdated quickly as the frontier model advances.
simonw2 months ago
I built a CLI script for feeding PDFs into this API - notes on that and my explorations of Mistral OCR here: <a href="https:&#x2F;&#x2F;simonwillison.net&#x2F;2025&#x2F;Mar&#x2F;7&#x2F;mistral-ocr&#x2F;" rel="nofollow">https:&#x2F;&#x2F;simonwillison.net&#x2F;2025&#x2F;Mar&#x2F;7&#x2F;mistral-ocr&#x2F;</a>
submeta2 months ago
Is this able to convert pdf flowcharts into yaml or json representations of them? I have been experimenting with Claude 3.5. It has been very good at readig &#x2F; understanding&#x2F; converting into representations of flow charts.<p>So I am wondering if this is more capable. Will try definitely, but maybe someone can chime in.
constantinum2 months ago
I see a lot of comments on hallucination risk and the accumulation of non-traceable rotten data. If you are curious to try a better non-llm-based OCR, try LLMWhisperer.<a href="https:&#x2F;&#x2F;pg.llmwhisperer.unstract.com&#x2F;" rel="nofollow">https:&#x2F;&#x2F;pg.llmwhisperer.unstract.com&#x2F;</a>
gatienboquet2 months ago
I feel like i can&#x27;t create an agent with their OCR model yet ? Is it something planned or it&#x27;s only API?
评论 #43284580 未加载
jcuenod2 months ago
Just tested with a multilingual (bidi) English&#x2F;Hebrew document.<p>The Hebrew output had no correspondence to the text whatsoever (in context, there was an English translation, and the Hebrew produced was a back-translation of that).<p>Their benchmark results are impressive, don&#x27;t get me wrong. But I&#x27;m a little disappointed. I often read multilingual document scans in the humanities. Multilingual (and esp. bidi) OCR is challenging, and I&#x27;m always looking for a better solution for a side-project I&#x27;m working on (fixpdfs.com).<p>Also, I thought OCR implied that you could get bounding boxes for text (and reconstruct a text layer on a scan, for example). Am I wrong, or is this term just overloaded, now?
评论 #43283420 未加载
pilooch2 months ago
But what&#x27;s the need exactly for OCR when you have multimodal LLMs that can read the same info and directly answer any questions about it ?<p>For a VLLM, my understanding is that OCR corresponds to a sub-field of questions, of the type &#x27;read exactly what&#x27;s written in this document&#x27;.
评论 #43284602 未加载
评论 #43285086 未加载
评论 #43284468 未加载
评论 #43284599 未加载
thomasahle2 months ago
I&#x27;m surprised they didn&#x27;t benchmark it against Pixtral.<p>They test it against a bunch of different Multimodal LLMs, so why not their own?<p>I don&#x27;t really see the purpose of the OCR form factor, when you have multimodal LLMs. Unless it&#x27;s significantly cheaper.
ein0p2 months ago
Could anyone suggest a tool which would take a bunch of PDFs (already OCR-d with Finereader), and replace the OCR overlay on all of them, maintaining the positions? I would like to have more accurate search over my document archive.
评论 #43299996 未加载
alberth2 months ago
Curious to see how this performance against more real world usage of someone taking a photo of text (which the text then becomes slightly blurred) and performing OCR on it.<p>I can&#x27;t exactly tell if the &quot;Mistral 7B&quot; image is an example of this exact scenario.
1010082 months ago
Is this free in LeChat? I uploaded a handwritten text and it stopped after the 4th word.
riffic2 months ago
It&#x27;d be great if this could be tested against genealogical documents written in cursive like oh most of the documents on microfilm stored by the LDS on familysearch, or eastern european archival projects etc.
lokl2 months ago
Tried with a few historical handwritten German documents, accuracy was abysmal.
评论 #43285306 未加载
评论 #43284121 未加载
评论 #43290576 未加载
评论 #43284109 未加载
评论 #43285597 未加载
评论 #43284256 未加载
dwedge2 months ago
Benchmarks look good. I tried this with a PDF that already has accurate PDF embedded just with new lines making pdftotext fail, and it was accurate for the text it found, but missed entire pages
monkeydust2 months ago
Spent time working on OCR problem many years ago for a mobile app. We found at the time that the preprocessing was so critical to the outcome (quality of image, angle, colour&#x2F;greyscale)
Gnan2 months ago
Is there an ocr with this kind of accuracy, but can run in a mobile device ? Looking for an ocr that can detect texts with high accuracy in realtime, so option of using cloud ocr is not viable.
joeevans10002 months ago
I&#x27;ve found that the stunning OCR results so far were because the models were trained on the example file category. Is that the case here? Or can this recognize various documents?
评论 #43286402 未加载
atemerev2 months ago
So, the only thing that stopped AI from learning from all our science and taking over the world was the difficulty of converting PDFs of academic papers to more computer readable formats.<p>Not anymore.
newfocogi2 months ago
They say: &quot;releasing the API mistral-ocr-latest at 1000 pages &#x2F; $&quot;<p>I had to reread that a few times. I assume this means 1000pg&#x2F;$1 but I&#x27;m still not sure about it.
评论 #43283374 未加载
评论 #43283287 未加载
评论 #43284255 未加载
评论 #43283204 未加载
polytely2 months ago
I don&#x27;t need AGI just give me superhuman OCR so we can turn all existing pdfs into text* and cheaply host it.<p>Feels like we are almost there.<p>*: <a href="https:&#x2F;&#x2F;annas-archive.org&#x2F;blog&#x2F;critical-window.html" rel="nofollow">https:&#x2F;&#x2F;annas-archive.org&#x2F;blog&#x2F;critical-window.html</a>
shmoogy2 months ago
What&#x27;s the general time for something like this to hit openrouter? I really hate having accounts everywhere when I&#x27;m trying to test new things.
deadbabe2 months ago
LLM based OCR is a disaster, great potential for hallucinations and no estimate of confidence. Results might seem promising but you’ll always be wondering.
评论 #43351619 未加载
评论 #43283810 未加载
评论 #43284148 未加载
kccqzy2 months ago
I have an actually hard OCR exercise for an AI model: I take this image of Chinese text on one of the memorial stones on the Washington Monument <a href="https:&#x2F;&#x2F;www.nps.gov&#x2F;articles&#x2F;american-mission-ningpo-china-220-level.htm" rel="nofollow">https:&#x2F;&#x2F;www.nps.gov&#x2F;articles&#x2F;american-mission-ningpo-china-2...</a> and ask the model to do OCR. Not a single model I&#x27;ve seen can OCR this correctly. Mistral is especially bad here: it gets stuck in an endless loop of nonsensical hallucinated text. Insofar as Mistral is design for &quot;preserving historical and cultural heritage&quot; it couldn&#x27;t do that very well yet.<p>A good model can recognize that the text is written top to bottom and then right to left and perform OCR in that direction. Apple&#x27;s Live Text can do that, though it makes plenty of mistakes otherwise. Mistral is far from that.
thiago_fm2 months ago
For general use this will be good.<p>But I bet that simple ML will lead to better OCRs when you are doing anything specialized, such as, medical documents, invoices etc.
coolspot2 months ago
This is $1 per 1000 pages.<p>For comparison, Azure Document Intelligence is $1.5&#x2F;1000 pages for general OCR and $30&#x2F;1000 pages for “custom extraction”.
kinnth2 months ago
This looks like a massive win if you were the NHS and had to scan and process old case notes.<p>Same is true if you were a solicitors&#x2F;lawyers.
jslezak2 months ago
Has anyone tried it for handwriting?<p>So far Gemini is the only model I can get decent output from for a particular hard handwriting task
applgo4432 months ago
What&#x27;s the simple explanation for why these VLM OCRs hallucinate but previous version of OCRs don&#x27;t?
评论 #43295889 未加载
thegabriele2 months ago
I&#x27;m using gemini to solve textual CAPTCHA with some good results (better than untrained OCR).<p>I will give this a shot
dehrmann2 months ago
Is this burying the lede? OCR is a solved problem, but structuring document data from scans isn&#x27;t.
Zufriedenheit2 months ago
How can I use these new OCR tools to make PDF files searchable by embedding the text layer?
评论 #43299988 未加载
nyeah2 months ago
It&#x27;s not fair to call it a &quot;Mistrial&quot; just because it hallucinates a little bit.
评论 #43286044 未加载
anovick2 months ago
How does one use it to identify bounding rectangles of images&#x2F;diagrams in the PDF?
OrvalWintermute2 months ago
I&#x27;m happy to see this development after being underwhelmed with Chatgpt OCR!
beebaween2 months ago
Wonder how it does with table data in pdfs &#x2F; page-long tabular data?
jhatemyjob2 months ago
As far as open source OCRs go, Tesseract is still the best, right?
cavisne2 months ago
Its funny how Gemini consistently beats googles dedicated document API.
评论 #43284841 未加载
d_llon2 months ago
It&#x27;s disappointing to see that the benchmark results are so opaque. I hope we see reproducible results soon, and hopefully from Mistral themselves.<p>1. We don&#x27;t know what the evaluation setup is. It&#x27;s very possible that the ranking would be different with a bit of prompt engineering.<p>2. We don&#x27;t know how large each dataset is (or even how the metrics are calculated&#x2F;aggregated). The metrics are all reported as XY.ZW%, but it&#x27;s very possible that the .ZW% -- or even Y.ZW% -- is just noise.[1]<p>3. We don&#x27;t know how the datasets were mined or filtered. Mistral could have (even accidentally!) filtered out particularly data points that their model struggled with. (E.g., imagine good-meaning engineer testing a document with Mistral OCR first, finding it doesn&#x27;t work, and deducing that it&#x27;s probably bad data and removing it.)<p>[1] <a href="https:&#x2F;&#x2F;medium.com&#x2F;towards-data-science&#x2F;digit-significance-in-machine-learning-dea05dd6b85b" rel="nofollow">https:&#x2F;&#x2F;medium.com&#x2F;towards-data-science&#x2F;digit-significance-i...</a>
jbverschoor2 months ago
Ohhh. Gonna test it out with some 100+ year old scribbles :)
评论 #43288127 未加载
WhitneyLand2 months ago
1. There’s no simple page &#x2F; sandbox to upload images and try it. Fine, I’ll code it up.<p>2. “Explore the Mistral AI APIs” (<a href="https:&#x2F;&#x2F;docs.mistral.ai" rel="nofollow">https:&#x2F;&#x2F;docs.mistral.ai</a>) links to all apis except OCR.<p>3. The docs on the api params refer to document chunking and image chunking but no details on how their chunking works?<p>So much unnecessary friction smh.
评论 #43284235 未加载
t_sea2 months ago
They really went for it with the hieroglyphs opening.
noloz2 months ago
Are there any open source projects with the same goal?
ritvikpandey212 months ago
as builders in this space, we decided to put it to the test on complex nested tables, pie charts, etc. to see if the same VLM hallucination issues persist, and to what degree. while results were promising, we found several critical failure nodes across two document domains.<p>check out our blog post here! <a href="https:&#x2F;&#x2F;www.runpulse.com&#x2F;blog&#x2F;beyond-the-hype-real-world-tests-of-mistrals-ocr">https:&#x2F;&#x2F;www.runpulse.com&#x2F;blog&#x2F;beyond-the-hype-real-world-tes...</a>
linklater122 months ago
Document processing is where b2b SAAS is at.
revskill2 months ago
Nextjs error is still uncauht correctly.
jwr2 months ago
Alas, I can&#x27;t run it locally. So it still doesn&#x27;t solve the problem of OCR for my PDF archive containing my private data...
maCDzP2 months ago
Oh - on premise solution - awesome!
zelcon2 months ago
Release the weights or buy an ad
sashank_15092 months ago
Really cool, thanks Mistral!
rjurney2 months ago
What about tables in PDFs?
Zopieux2 months ago
Saving you a click: no, it cannot be self hosted (unless you have a few million dollars laying around)
bugglebeetle2 months ago
Congrats to Mistral for yet again releasing another closed source thing that costs more than running an open source equivalent:<p><a href="https:&#x2F;&#x2F;github.com&#x2F;DS4SD&#x2F;docling">https:&#x2F;&#x2F;github.com&#x2F;DS4SD&#x2F;docling</a>
评论 #43284297 未加载
评论 #43284724 未加载
joeevans10002 months ago
Can someone give me a tl&amp;dr on how to start using this? Is this available if one signs up for a regular Mistral account?
kiratp2 months ago
It&#x27;s shocking how much our industry fails to see past its own nose.<p>Not a single example on that page is a Purchase Order, Invoice etc. Not a single example shown is relevant to industry at scale.
评论 #43284862 未加载
评论 #43284834 未加载
评论 #43284647 未加载
评论 #43285106 未加载
评论 #43284725 未加载
评论 #43284896 未加载
评论 #43285606 未加载
评论 #43284778 未加载
评论 #43285092 未加载
评论 #43285078 未加载
bondolo2 months ago
Such a shame that PDF doesn’t just, like, include the semantic structure of the document by default. It is brilliant that we standardized on an archival document format that doesn’t include direct access to the document text or structure as a core intrinsic default feature.<p>I say this with great anger as someone who works in accessibility and has had PDF as a thorn in my side for 30 years.
评论 #43284970 未加载
评论 #43284996 未加载
评论 #43285042 未加载
评论 #43284912 未加载
评论 #43288773 未加载
评论 #43285082 未加载
sunami-ai2 months ago
Making Transformers the same cost as CNN&#x27;s (which are used in character-level ocr, as opposed to image-patch-level) is a good thing. The problem with CNN based character-level OCR is not the recognition models but the detection models. In a former life, I found a way to increase detection accuracy, and, therefore, overall OCR accuracy, and used that as an enhancement on top of Amazon and Google OCR. It worked really well. But the transformer approach is more powerful and if it can be done for $1 per 1000 pages, that is a game changer, IMO, at least of incumbents offering traditional character-level OCR.
评论 #43283666 未加载
hyuuu2 months ago
It&#x27;s a weird timing because I just launched <a href="https:&#x2F;&#x2F;dochq.io" rel="nofollow">https:&#x2F;&#x2F;dochq.io</a> - ai document extraction where you can define what you need to get out your documents in plain English, I legitimately thought that this was going to be such a niche product but hell, there has been a very rapid rise for AI-based OCR lately, an article&#x2F;tweet even went viral 2 weeks ago I think? About using Gemini to do OCR, fun times.