TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Show HN: how I built the largest open database of Australian law

172 点作者 ubutler超过 1 年前

22 条评论

ubutler超过 1 年前
Hey HN, Over the past year, I’ve been working on building the Open Australian Legal Corpus, the largest open database of Australian law. I started this project when I realised there were no open databases of Australian law I could use to train an LLM on.<p>In this article, I run through the entire process of how I built my database, from months-long negotiations with governments to reverse engineering ancient web technologies to hacking together a multitude of different solutions for extracting text from documents.<p>My hope is that the next time someone like me is interested in training an LLM to solve legal problems, they won&#x27;t have to go down a year-long journey of trying to find the right data!<p>You can find my database on HuggingFace (<a href="https:&#x2F;&#x2F;huggingface.co&#x2F;datasets&#x2F;umarbutler&#x2F;open-australian-legal-corpus" rel="nofollow noreferrer">https:&#x2F;&#x2F;huggingface.co&#x2F;datasets&#x2F;umarbutler&#x2F;open-australian-l...</a>) and the code used to create it on GitHub (<a href="https:&#x2F;&#x2F;github.com&#x2F;umarbutler&#x2F;open-australian-legal-corpus-creator">https:&#x2F;&#x2F;github.com&#x2F;umarbutler&#x2F;open-australian-legal-corpus-c...</a>).
评论 #38063514 未加载
评论 #38063712 未加载
评论 #38115101 未加载
评论 #38064001 未加载
评论 #38064775 未加载
评论 #38060387 未加载
评论 #38058914 未加载
juliangamble超过 1 年前
Australia has had free, searchable collections of Australian Law for 25+ years. Austlii is a prime example. There are Federal and State collections as well. The author is conscientious enough to read the scraping policy (or was blocked by anti-scraping tools) from feeding from one of these sites into his LLM.
评论 #38063584 未加载
cookie_monsta超过 1 年前
This is cool and I&#x27;m a little surprised to see that Victoria is the one dragging the chain here. Is DataVic just talk, or does that not apply to law for some reason?
评论 #38063689 未加载
评论 #38064571 未加载
freefaler超过 1 年前
Great work and congratulations on your tenacity dealing with bureaucrats. Open access and machine readable formats should be widely available.
anakaine超过 1 年前
Good work reaching out to, and trying to get along with Australian government departments. As a fellow Australian, and one employed in government, I can very much say that many people in charge of operating these systems should not be.
jfil超过 1 年前
I&#x27;m floored by what you accomplished in a year. Here in Canada, case law is under an iron grip by Canlii, LexisNexis and Thomson Reuters (Westlaw). As far as I&#x27;m aware, there is no truly open digital dataset that the public can use. This situation suits all the players involved and it isn&#x27;t changing - it&#x27;s refreshing to hear that one person can make a difference in a very similar setting (Australia)
emmelaich超过 1 年前
Nice, but I have to ask how does it compare with <a href="https:&#x2F;&#x2F;austlii.edu.au&#x2F;" rel="nofollow noreferrer">https:&#x2F;&#x2F;austlii.edu.au&#x2F;</a>, especially for completeness?<p>The Australasian Legal Information Institute is a great resource and yet seems strangely unknown (to the wider public at least.)<p>Trivia, the only reason I found out about it was when I did some work for an Aus govt agency and found out that they shared their web site with austlii! This was back in the early 2000s.
评论 #38064732 未加载
RagnarD超过 1 年前
In my experience, one of the devil details is continuously keeping such a database updated. Without a set of common standards among the various governments, they can capriciously change URLs, formatting, and other details that may make it somebody&#x27;s fulltime job to keep it accurate and always up to date. Of course, not all use cases will require that, but many will.
评论 #38064254 未加载
Obscurity4340超过 1 年前
What do you think of the Canadian legal case law website CanLii? What could it do better or do you think its done well?<p>Is it overdue for innovation?
评论 #38065124 未加载
评论 #38064691 未加载
DamonHD超过 1 年前
Would it we worth getting your corpus replicated into other venues as well, such at the Internet Archive or on GitHub itself?
评论 #38058133 未加载
jfil超过 1 年前
I&#x27;m floored by what you accomplished in a year. Here in Canada, case law is under an iron grip by Canlii, LexisNexis and Thomson Reuters (Westlaw). As far as I&#x27;m aware, there is no truly open digital dataset that the public can use. This situation suits all the players involved and it isn&#x27;t changing - it&#x27;s refreshing to hear that one person can make a difference in a very similar environment (Australia)
Mrirazak1超过 1 年前
It would be lovely I think if you used ML to help people ask questions so they can have more accessible law at their hands and understand what lots of things mean.<p>This can be applied to multiple countries around the world. The world of laws at your hands.<p>It’s an interesting concept
darcys22超过 1 年前
So good! Its crazy how legal information is such a spread out mess.<p>Whats worse is that git is such a perfect solution for legislation.
评论 #38065375 未加载
Syeposxr超过 1 年前
Could you explain how the majority of your corpus is under CC BY 4.0? I realise that&#x27;s the licence you have picked on HuggingFace, but if the source data was not already CC BY 4.0, how are you able to re-licence it as CC BY 4.0?
评论 #38085188 未加载
thomasfromcdnjs超过 1 年前
Incredible effort.<p>These types of projects have the potential to influence a nation.
ulrischa超过 1 年前
I think in the law related subjects there is a huge potential for digitalisation. In Germany the law texts are online but the paragraphs not linked
mediumsmart超过 1 年前
Great work, thank you so much.<p>Fwiw and getting formatted text from html did you try<p>lynx —-dump url &gt;&gt; file.plaintext
smcleod超过 1 年前
That&#x27;s really neat. Such a shame VIC couldn&#x27;t be included though.
danielmarkbruce超过 1 年前
Insanely great. Amazing work.
nextworddev超过 1 年前
Is there a U.S. equivalent?
评论 #38064159 未加载
评论 #38062163 未加载
评论 #38062288 未加载
评论 #38062379 未加载
评论 #38061907 未加载
tamarlikesdata超过 1 年前
Nice. Thanks for sharing.
subhashp超过 1 年前
Well done!