TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Show HN: Parsr – A toolchain to transform documents into usable structured text

182 点作者 pierre超过 5 年前

9 条评论

hbcondo714超过 5 年前
It appears images and PDFs are the currently supported document types. If so, is there an opportunity to support web pages? We have quite a few verbose legal documents that consist of auto-geneated HTML that average 100 pages. A tool like this would be helpful in automatically dividing these into their respective section heading. It would also be beneficial to detect / remove extra info such as page numbers. Thanks for posting this!
评论 #22038388 未加载
评论 #22039080 未加载
pininja超过 5 年前
This seems like a super useful way to package and deliver this kind of tool chain! I’ve been looking for PDF parsers on and off the last couple of years, and have found it challenging to get most tools set up for data extraction and analysis.<p>This one packages a off the shelf version into a Docker, and starts a GUI website locally. Looking forward to using this more!
ZeroCool2u超过 5 年前
Hmmm, looks useful. The list of dependencies is basically a who&#x27;s who of doing various types of document parsing. Is this basically just a unified interface that wraps them all up into an API?
评论 #22036834 未加载
staticautomatic超过 5 年前
I was really excited to try this until I saw that the only extraction methods are pdfminer, finereader, and tesseract. I was hoping there was something you rolled on your own. I&#x27;ve been trying for a long time to parse tables (and nested tables) but the available extractors seem to only work on really simple, idealized tables with virtually no skew or warping. The best I&#x27;ve found so far as Amazon&#x27;s Textract, but it&#x27;s not that great either. Alas, every attempt I&#x27;ve ever made at generalized table extraction has quickly regressed to templates.
评论 #22042412 未加载
评论 #22038264 未加载
评论 #22037376 未加载
anilgulecha超过 5 年前
Are there any example inputs and outputs to quickly see what&#x27;s possible?
评论 #22036801 未加载
评论 #22036809 未加载
kresten超过 5 年前
Apache Tika is a powerful text extraction engine.<p>Why this over Tika?
评论 #22042773 未加载
udayrddy超过 5 年前
That looks like a great comprehensive tool kit for data extraction. I understand the bundle is licensed under Apache, I&#x27;m curious to check on the needs&#x2F;rules-to-follow to include a service like Abbyy.<p>We, extracttable.com - extract tabular data from images and PDFs over API, are interested to contribute and integrate the service into the bundle.
Tade0超过 5 年前
My old beater is insured with AXA. I didn&#x27;t know they had any open source projects going on.
all-out-of-hope超过 5 年前
Very cool, amazing this is OSS.