Ask HN: Information extraction software?

5 点作者 mjfern大约 9 年前

I'm looking for information extraction software that I can feed in historical legal agreements and it will report:1. Changes in the text between the documents2. Changes in other attributes of the documents (e.g., word count)3. % change over time in the text and attributes (e.g., text in the 1986 version of the doc is 56% different than the text in the 1985 version of the doc)Can anyone please point me to software that might fit this particular need?Thanks in advance! Michael

4 条评论

tgflynn大约 9 年前

I don't know about software designed specifically for doing this with legal documents but most of this you could probably do quite easily with some simple Unix tools and a little scripting. I'm guessing the documents aren't plain text so the first step would be to extract the text. For example if they're pdf's you could use pdftotext.Then for:1. diff/diff viewers like xxdiff(the name may have changed somewhat recently)/git2. wc3. diff with some scripts to automatically process the documents, count the number of words in the documents, and write to a csv file

salaroglio大约 9 年前

You can evaluate the project <a href="http://eucases.eu/" rel="nofollow">http://eucases.eu/</a>, you can evalute the AKOMANTOSO standard Here you can find an editor <a href="https://legixinfo.wordpress.com/2015/07/02/coming-soon-a-new-web-based-editor-for-akoma-ntoso/" rel="nofollow">https://legixinfo.wordpress.com/2015/07/02/coming-soon-a-new...</a>

BjoernKW大约 9 年前

Apache UIMA ( <a href="https://uima.apache.org/" rel="nofollow">https://uima.apache.org/</a> ) and GATE ( <a href="https://gate.ac.uk/ie/" rel="nofollow">https://gate.ac.uk/ie/</a> ) come to mind.Those are not ready-made software products, though but rather frameworks that allow you to implement IE algorithms. While not exactly trivial, implementing something like what you're suggesting is definitely possible with GATE.

dreamdu5t大约 9 年前

I can build the software for you if you need.