Hey HN,
Over the past year, I’ve been working on building the Open Australian Legal Corpus, the largest open database of Australian law. I started this project when I realised there were no open databases of Australian law I could use to train an LLM on.<p>In this article, I run through the entire process of how I built my database, from months-long negotiations with governments to reverse engineering ancient web technologies to hacking together a multitude of different solutions for extracting text from documents.<p>My hope is that the next time someone like me is interested in training an LLM to solve legal problems, they won't have to go down a year-long journey of trying to find the right data!<p>You can find my database on HuggingFace (<a href="https://huggingface.co/datasets/umarbutler/open-australian-legal-corpus" rel="nofollow noreferrer">https://huggingface.co/datasets/umarbutler/open-australian-l...</a>) and the code used to create it on GitHub (<a href="https://github.com/umarbutler/open-australian-legal-corpus-creator">https://github.com/umarbutler/open-australian-legal-corpus-c...</a>).
Australia has had free, searchable collections of Australian Law for 25+ years. Austlii is a prime example. There are Federal and State collections as well.
The author is conscientious enough to read the scraping policy (or was blocked by anti-scraping tools) from feeding from one of these sites into his LLM.
This is cool and I'm a little surprised to see that Victoria is the one dragging the chain here. Is DataVic just talk, or does that not apply to law for some reason?
Good work reaching out to, and trying to get along with Australian government departments. As a fellow Australian, and one employed in government, I can very much say that many people in charge of operating these systems should not be.
I'm floored by what you accomplished in a year. Here in Canada, case law is under an iron grip by Canlii, LexisNexis and Thomson Reuters (Westlaw). As far as I'm aware, there is no truly open digital dataset that the public can use. This situation suits all the players involved and it isn't changing - it's refreshing to hear that one person can make a difference in a very similar setting (Australia)
Nice, but I have to ask how does it compare with <a href="https://austlii.edu.au/" rel="nofollow noreferrer">https://austlii.edu.au/</a>, especially for completeness?<p>The Australasian Legal Information Institute is a great resource and yet seems strangely unknown (to the wider public at least.)<p>Trivia, the only reason I found out about it was when I did some work for an Aus govt agency and found out that they shared their web site with austlii! This was back in the early 2000s.
In my experience, one of the devil details is continuously keeping such a database updated. Without a set of common standards among the various governments, they can capriciously change URLs, formatting, and other details that may make it somebody's fulltime job to keep it accurate and always up to date. Of course, not all use cases will require that, but many will.
What do you think of the Canadian legal case law website CanLii? What could it do better or do you think its done well?<p>Is it overdue for innovation?
I'm floored by what you accomplished in a year. Here in Canada, case law is under an iron grip by Canlii, LexisNexis and Thomson Reuters (Westlaw). As far as I'm aware, there is no truly open digital dataset that the public can use. This situation suits all the players involved and it isn't changing - it's refreshing to hear that one person can make a difference in a very similar environment (Australia)
It would be lovely I think if you used ML to help people ask questions so they can have more accessible law at their hands and understand what lots of things mean.<p>This can be applied to multiple countries around the world. The world of laws at your hands.<p>It’s an interesting concept
Could you explain how the majority of your corpus is under CC BY 4.0? I realise that's the licence you have picked on HuggingFace, but if the source data was not already CC BY 4.0, how are you able to re-licence it as CC BY 4.0?