Hey HN,
Over the past year, I’ve been working on building the Open Australian Legal Corpus, the largest open database of Australian law. I started this project when I realised there were no open databases of Australian law I could use to train an LLM on.<p>In this article, I run through the entire process of how I built my database, from months-long negotiations with governments to reverse engineering ancient web technologies to hacking together a multitude of different solutions for extracting text from documents.<p>My hope is that the next time someone like me is interested in training an LLM to solve legal problems, they won't have to go down a year-long journey of trying to find the right data!<p>You can find my database on HuggingFace (<a href="https://huggingface.co/datasets/umarbutler/open-australian-legal-corpus" rel="nofollow noreferrer">https://huggingface.co/datasets/umarbutler/open-australian-l...</a>) and the code used to create it on GitHub (<a href="https://github.com/umarbutler/open-australian-legal-corpus-creator">https://github.com/umarbutler/open-australian-legal-corpus-c...</a>).