Show HN: how I built the largest open database of Australian law

172 pointsby ubutlerover 1 year ago

22 comments

ubutlerover 1 year ago

Hey HN, Over the past year, I’ve been working on building the Open Australian Legal Corpus, the largest open database of Australian law. I started this project when I realised there were no open databases of Australian law I could use to train an LLM on.In this article, I run through the entire process of how I built my database, from months-long negotiations with governments to reverse engineering ancient web technologies to hacking together a multitude of different solutions for extracting text from documents.My hope is that the next time someone like me is interested in training an LLM to solve legal problems, they won't have to go down a year-long journey of trying to find the right data!You can find my database on HuggingFace (<a href="https://huggingface.co/datasets/umarbutler/open-australian-legal-corpus" rel="nofollow noreferrer">https://huggingface.co/datasets/umarbutler/open-australian-l...</a>) and the code used to create it on GitHub (<a href="https://github.com/umarbutler/open-australian-legal-corpus-creator">https://github.com/umarbutler/open-australian-legal-corpus-c...</a>).

评论 #38063514 未加载

评论 #38063712 未加载

评论 #38115101 未加载

评论 #38064001 未加载

评论 #38064775 未加载

评论 #38060387 未加载

评论 #38058914 未加载

juliangambleover 1 year ago

Australia has had free, searchable collections of Australian Law for 25+ years. Austlii is a prime example. There are Federal and State collections as well. The author is conscientious enough to read the scraping policy (or was blocked by anti-scraping tools) from feeding from one of these sites into his LLM.

评论 #38063584 未加载

cookie_monstaover 1 year ago

This is cool and I'm a little surprised to see that Victoria is the one dragging the chain here. Is DataVic just talk, or does that not apply to law for some reason?

评论 #38063689 未加载

评论 #38064571 未加载

freefalerover 1 year ago

Great work and congratulations on your tenacity dealing with bureaucrats. Open access and machine readable formats should be widely available.

anakaineover 1 year ago

Good work reaching out to, and trying to get along with Australian government departments. As a fellow Australian, and one employed in government, I can very much say that many people in charge of operating these systems should not be.

jfilover 1 year ago

I'm floored by what you accomplished in a year. Here in Canada, case law is under an iron grip by Canlii, LexisNexis and Thomson Reuters (Westlaw). As far as I'm aware, there is no truly open digital dataset that the public can use. This situation suits all the players involved and it isn't changing - it's refreshing to hear that one person can make a difference in a very similar setting (Australia)

emmelaichover 1 year ago

Nice, but I have to ask how does it compare with <a href="https://austlii.edu.au/" rel="nofollow noreferrer">https://austlii.edu.au/</a>, especially for completeness?The Australasian Legal Information Institute is a great resource and yet seems strangely unknown (to the wider public at least.)Trivia, the only reason I found out about it was when I did some work for an Aus govt agency and found out that they shared their web site with austlii! This was back in the early 2000s.

评论 #38064732 未加载

RagnarDover 1 year ago

In my experience, one of the devil details is continuously keeping such a database updated. Without a set of common standards among the various governments, they can capriciously change URLs, formatting, and other details that may make it somebody's fulltime job to keep it accurate and always up to date. Of course, not all use cases will require that, but many will.

评论 #38064254 未加载

Obscurity4340over 1 year ago

What do you think of the Canadian legal case law website CanLii? What could it do better or do you think its done well?Is it overdue for innovation?

评论 #38065124 未加载

评论 #38064691 未加载

DamonHDover 1 year ago

Would it we worth getting your corpus replicated into other venues as well, such at the Internet Archive or on GitHub itself?

评论 #38058133 未加载

jfilover 1 year ago

Mrirazak1over 1 year ago

It would be lovely I think if you used ML to help people ask questions so they can have more accessible law at their hands and understand what lots of things mean.This can be applied to multiple countries around the world. The world of laws at your hands.It’s an interesting concept

darcys22over 1 year ago

So good! Its crazy how legal information is such a spread out mess.Whats worse is that git is such a perfect solution for legislation.

评论 #38065375 未加载

Syeposxrover 1 year ago

Could you explain how the majority of your corpus is under CC BY 4.0? I realise that's the licence you have picked on HuggingFace, but if the source data was not already CC BY 4.0, how are you able to re-licence it as CC BY 4.0?

评论 #38085188 未加载

thomasfromcdnjsover 1 year ago

Incredible effort.These types of projects have the potential to influence a nation.

ulrischaover 1 year ago

I think in the law related subjects there is a huge potential for digitalisation. In Germany the law texts are online but the paragraphs not linked

mediumsmartover 1 year ago

Great work, thank you so much.Fwiw and getting formatted text from html did you trylynx —-dump url >> file.plaintext

smcleodover 1 year ago

That's really neat. Such a shame VIC couldn't be included though.

danielmarkbruceover 1 year ago

Insanely great. Amazing work.

nextworddevover 1 year ago

Is there a U.S. equivalent?

评论 #38064159 未加载

评论 #38062163 未加载

评论 #38062288 未加载

评论 #38062379 未加载

评论 #38061907 未加载

tamarlikesdataover 1 year ago

Nice. Thanks for sharing.

subhashpover 1 year ago

Well done!

22 comments

ubutlerover 1 year ago

评论 #38063514 未加载

评论 #38063712 未加载

评论 #38115101 未加载

评论 #38064001 未加载

评论 #38064775 未加载

评论 #38060387 未加载

评论 #38058914 未加载

juliangambleover 1 year ago

评论 #38063584 未加载

cookie_monstaover 1 year ago

This is cool and I'm a little surprised to see that Victoria is the one dragging the chain here. Is DataVic just talk, or does that not apply to law for some reason?

评论 #38063689 未加载

评论 #38064571 未加载

freefalerover 1 year ago

Great work and congratulations on your tenacity dealing with bureaucrats. Open access and machine readable formats should be widely available.

anakaineover 1 year ago

jfilover 1 year ago

emmelaichover 1 year ago

评论 #38064732 未加载

RagnarDover 1 year ago

评论 #38064254 未加载

Obscurity4340over 1 year ago

What do you think of the Canadian legal case law website CanLii? What could it do better or do you think its done well?Is it overdue for innovation?

评论 #38065124 未加载

评论 #38064691 未加载

DamonHDover 1 year ago

Would it we worth getting your corpus replicated into other venues as well, such at the Internet Archive or on GitHub itself?

评论 #38058133 未加载

jfilover 1 year ago

Mrirazak1over 1 year ago

darcys22over 1 year ago

So good! Its crazy how legal information is such a spread out mess.Whats worse is that git is such a perfect solution for legislation.

评论 #38065375 未加载

Syeposxrover 1 year ago

评论 #38085188 未加载

thomasfromcdnjsover 1 year ago

Incredible effort.These types of projects have the potential to influence a nation.

ulrischaover 1 year ago

I think in the law related subjects there is a huge potential for digitalisation. In Germany the law texts are online but the paragraphs not linked

mediumsmartover 1 year ago

Great work, thank you so much.Fwiw and getting formatted text from html did you trylynx —-dump url >> file.plaintext

smcleodover 1 year ago

That's really neat. Such a shame VIC couldn't be included though.

danielmarkbruceover 1 year ago

Insanely great. Amazing work.

nextworddevover 1 year ago

Is there a U.S. equivalent?

评论 #38064159 未加载

评论 #38062163 未加载

评论 #38062288 未加载

评论 #38062379 未加载

评论 #38061907 未加载

tamarlikesdataover 1 year ago

Nice. Thanks for sharing.

subhashpover 1 year ago

Well done!