Show HN: how I built the largest open database of Australian law

172 点作者 ubutler超过 1 年前

22 条评论

ubutler超过 1 年前

Hey HN, Over the past year, I’ve been working on building the Open Australian Legal Corpus, the largest open database of Australian law. I started this project when I realised there were no open databases of Australian law I could use to train an LLM on.In this article, I run through the entire process of how I built my database, from months-long negotiations with governments to reverse engineering ancient web technologies to hacking together a multitude of different solutions for extracting text from documents.My hope is that the next time someone like me is interested in training an LLM to solve legal problems, they won't have to go down a year-long journey of trying to find the right data!You can find my database on HuggingFace (<a href="https://huggingface.co/datasets/umarbutler/open-australian-legal-corpus" rel="nofollow noreferrer">https://huggingface.co/datasets/umarbutler/open-australian-l...</a>) and the code used to create it on GitHub (<a href="https://github.com/umarbutler/open-australian-legal-corpus-creator">https://github.com/umarbutler/open-australian-legal-corpus-c...</a>).

评论 #38063514 未加载

评论 #38063712 未加载

评论 #38115101 未加载

评论 #38064001 未加载

评论 #38064775 未加载

评论 #38060387 未加载

评论 #38058914 未加载

juliangamble超过 1 年前

Australia has had free, searchable collections of Australian Law for 25+ years. Austlii is a prime example. There are Federal and State collections as well. The author is conscientious enough to read the scraping policy (or was blocked by anti-scraping tools) from feeding from one of these sites into his LLM.

评论 #38063584 未加载

cookie_monsta超过 1 年前

This is cool and I'm a little surprised to see that Victoria is the one dragging the chain here. Is DataVic just talk, or does that not apply to law for some reason?

评论 #38063689 未加载

评论 #38064571 未加载

freefaler超过 1 年前

Great work and congratulations on your tenacity dealing with bureaucrats. Open access and machine readable formats should be widely available.

anakaine超过 1 年前

Good work reaching out to, and trying to get along with Australian government departments. As a fellow Australian, and one employed in government, I can very much say that many people in charge of operating these systems should not be.

jfil超过 1 年前

I'm floored by what you accomplished in a year. Here in Canada, case law is under an iron grip by Canlii, LexisNexis and Thomson Reuters (Westlaw). As far as I'm aware, there is no truly open digital dataset that the public can use. This situation suits all the players involved and it isn't changing - it's refreshing to hear that one person can make a difference in a very similar setting (Australia)

emmelaich超过 1 年前

Nice, but I have to ask how does it compare with <a href="https://austlii.edu.au/" rel="nofollow noreferrer">https://austlii.edu.au/</a>, especially for completeness?The Australasian Legal Information Institute is a great resource and yet seems strangely unknown (to the wider public at least.)Trivia, the only reason I found out about it was when I did some work for an Aus govt agency and found out that they shared their web site with austlii! This was back in the early 2000s.

评论 #38064732 未加载

RagnarD超过 1 年前

In my experience, one of the devil details is continuously keeping such a database updated. Without a set of common standards among the various governments, they can capriciously change URLs, formatting, and other details that may make it somebody's fulltime job to keep it accurate and always up to date. Of course, not all use cases will require that, but many will.

评论 #38064254 未加载

Obscurity4340超过 1 年前

What do you think of the Canadian legal case law website CanLii? What could it do better or do you think its done well?Is it overdue for innovation?

评论 #38065124 未加载

评论 #38064691 未加载

DamonHD超过 1 年前

Would it we worth getting your corpus replicated into other venues as well, such at the Internet Archive or on GitHub itself?

评论 #38058133 未加载

jfil超过 1 年前

Mrirazak1超过 1 年前

It would be lovely I think if you used ML to help people ask questions so they can have more accessible law at their hands and understand what lots of things mean.This can be applied to multiple countries around the world. The world of laws at your hands.It’s an interesting concept

darcys22超过 1 年前

So good! Its crazy how legal information is such a spread out mess.Whats worse is that git is such a perfect solution for legislation.

评论 #38065375 未加载

Syeposxr超过 1 年前

Could you explain how the majority of your corpus is under CC BY 4.0? I realise that's the licence you have picked on HuggingFace, but if the source data was not already CC BY 4.0, how are you able to re-licence it as CC BY 4.0?

评论 #38085188 未加载

thomasfromcdnjs超过 1 年前

Incredible effort.These types of projects have the potential to influence a nation.

ulrischa超过 1 年前

I think in the law related subjects there is a huge potential for digitalisation. In Germany the law texts are online but the paragraphs not linked

mediumsmart超过 1 年前

Great work, thank you so much.Fwiw and getting formatted text from html did you trylynx —-dump url >> file.plaintext

smcleod超过 1 年前

That's really neat. Such a shame VIC couldn't be included though.

danielmarkbruce超过 1 年前

Insanely great. Amazing work.

nextworddev超过 1 年前

Is there a U.S. equivalent?

评论 #38064159 未加载

评论 #38062163 未加载

评论 #38062288 未加载

评论 #38062379 未加载

评论 #38061907 未加载

tamarlikesdata超过 1 年前

Nice. Thanks for sharing.

subhashp超过 1 年前

Well done!

22 条评论

ubutler超过 1 年前

评论 #38063514 未加载

评论 #38063712 未加载

评论 #38115101 未加载

评论 #38064001 未加载

评论 #38064775 未加载

评论 #38060387 未加载

评论 #38058914 未加载

juliangamble超过 1 年前

评论 #38063584 未加载

cookie_monsta超过 1 年前

This is cool and I'm a little surprised to see that Victoria is the one dragging the chain here. Is DataVic just talk, or does that not apply to law for some reason?

评论 #38063689 未加载

评论 #38064571 未加载

freefaler超过 1 年前

Great work and congratulations on your tenacity dealing with bureaucrats. Open access and machine readable formats should be widely available.

anakaine超过 1 年前

jfil超过 1 年前

emmelaich超过 1 年前

评论 #38064732 未加载

RagnarD超过 1 年前

评论 #38064254 未加载

Obscurity4340超过 1 年前

What do you think of the Canadian legal case law website CanLii? What could it do better or do you think its done well?Is it overdue for innovation?

评论 #38065124 未加载

评论 #38064691 未加载

DamonHD超过 1 年前

Would it we worth getting your corpus replicated into other venues as well, such at the Internet Archive or on GitHub itself?

评论 #38058133 未加载

jfil超过 1 年前

Mrirazak1超过 1 年前

darcys22超过 1 年前

So good! Its crazy how legal information is such a spread out mess.Whats worse is that git is such a perfect solution for legislation.

评论 #38065375 未加载

Syeposxr超过 1 年前

评论 #38085188 未加载

thomasfromcdnjs超过 1 年前

Incredible effort.These types of projects have the potential to influence a nation.

ulrischa超过 1 年前

I think in the law related subjects there is a huge potential for digitalisation. In Germany the law texts are online but the paragraphs not linked

mediumsmart超过 1 年前

Great work, thank you so much.Fwiw and getting formatted text from html did you trylynx —-dump url >> file.plaintext

smcleod超过 1 年前

That's really neat. Such a shame VIC couldn't be included though.

danielmarkbruce超过 1 年前

Insanely great. Amazing work.

nextworddev超过 1 年前

Is there a U.S. equivalent?

评论 #38064159 未加载

评论 #38062163 未加载

评论 #38062288 未加载

评论 #38062379 未加载

评论 #38061907 未加载

tamarlikesdata超过 1 年前

Nice. Thanks for sharing.

subhashp超过 1 年前

Well done!