TE
テックエコー
ホーム24時間トップ最新ベスト質問ショー求人
GitHubTwitter
ホーム

テックエコー

Next.jsで構築されたテクノロジーニュースプラットフォームで、グローバルなテクノロジーニュースとディスカッションを提供します。

GitHubTwitter

ホーム

ホーム最新ベスト質問ショー求人

リソース

HackerNews APIオリジナルHackerNewsNext.js

© 2025 テックエコー. すべての権利を保有。

How we made our OCR code more accurate

58 ポイント投稿者: thunderbong4日前

8 comments

bluelightning2k3日前
I can&#x27;t say I&#x27;ve ever wanted to transcribe code from an image. That seems super niche.<p>Perhaps the specific idea is to harvest coding textbooks as training data for LLMs?
评论 #44067102 未加载
评论 #44062008 未加载
评论 #44067286 未加载
评论 #44063135 未加载
abc-13日前
Anything that mentions tesseract is about 10 years out of date at this point.
评论 #44061840 未加载
评论 #44061260 未加载
评论 #44061234 未加载
评论 #44061453 未加载
camtarn3日前
Neat article, but I feel like I have no idea why they&#x27;re doing this! Is transcribing code from images really such a big use case?
评论 #44061001 未加载
评论 #44061476 未加载
评论 #44061221 未加载
评论 #44061187 未加载
评论 #44062212 未加载
评论 #44063390 未加载
bobosha2日前
has anyone tried feeding the admittedly noisy OCR-ed text -at a document level - to an LLM for making sense? Presumably some of the less capable ones should be quite affordable and accurate at scale as well.
lesuorac2日前
OCR is the biggest XY problem.<p>Stop accepting PDFs and force things to use APIs ...
MoonGhost2日前
Even small upscale model trained on texts should do better than big generic.
sushid2日前
Making OCR more accurate for regular text (e.g. data extraction from documents) would be useful; not sure how useful code transcription is
vaxman2日前
Tesseract OCR was created by digital (DEC) in 19_8_5 (yes, 40 not four YEARs ago). Now go back and read the article and ROFL with me.
评论 #44061942 未加载
评论 #44062958 未加载
评论 #44063423 未加载