TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Ask HN: Information extraction software?

5 点作者 mjfern大约 9 年前
I&#x27;m looking for information extraction software that I can feed in historical legal agreements and it will report:<p>1. Changes in the text between the documents<p>2. Changes in other attributes of the documents (e.g., word count)<p>3. % change over time in the text and attributes (e.g., text in the 1986 version of the doc is 56% different than the text in the 1985 version of the doc)<p>Can anyone please point me to software that might fit this particular need?<p>Thanks in advance! Michael

4 条评论

tgflynn大约 9 年前
I don&#x27;t know about software designed specifically for doing this with legal documents but most of this you could probably do quite easily with some simple Unix tools and a little scripting. I&#x27;m guessing the documents aren&#x27;t plain text so the first step would be to extract the text. For example if they&#x27;re pdf&#x27;s you could use pdftotext.<p>Then for:<p>1. diff&#x2F;diff viewers like xxdiff(the name may have changed somewhat recently)&#x2F;git<p>2. wc<p>3. diff with some scripts to automatically process the documents, count the number of words in the documents, and write to a csv file
salaroglio大约 9 年前
You can evaluate the project <a href="http:&#x2F;&#x2F;eucases.eu&#x2F;" rel="nofollow">http:&#x2F;&#x2F;eucases.eu&#x2F;</a>, you can evalute the AKOMANTOSO standard Here you can find an editor <a href="https:&#x2F;&#x2F;legixinfo.wordpress.com&#x2F;2015&#x2F;07&#x2F;02&#x2F;coming-soon-a-new-web-based-editor-for-akoma-ntoso&#x2F;" rel="nofollow">https:&#x2F;&#x2F;legixinfo.wordpress.com&#x2F;2015&#x2F;07&#x2F;02&#x2F;coming-soon-a-new...</a>
BjoernKW大约 9 年前
Apache UIMA ( <a href="https:&#x2F;&#x2F;uima.apache.org&#x2F;" rel="nofollow">https:&#x2F;&#x2F;uima.apache.org&#x2F;</a> ) and GATE ( <a href="https:&#x2F;&#x2F;gate.ac.uk&#x2F;ie&#x2F;" rel="nofollow">https:&#x2F;&#x2F;gate.ac.uk&#x2F;ie&#x2F;</a> ) come to mind.<p>Those are not ready-made software products, though but rather frameworks that allow you to implement IE algorithms. While not exactly trivial, implementing something like what you&#x27;re suggesting is definitely possible with GATE.
dreamdu5t大约 9 年前
I can build the software for you if you need.