Hey HN,
I'm excited to share one of my most ambitious projects yet, EmuBert.<p>EmuBert is the largest and <i>most accurate</i> open-source masked language model for Australian law.<p>Trained on 180,000 laws, regulations and decisions across six Australian jurisdictions, totalling 1.4 billion tokens, taken from the Open Australian Legal Corpus, the largest open-source database of Australian law, EmuBert is well suited for tasks like:
⦁ Text classification;
⦁ Name extraction;
⦁ Question answering;
⦁ Text similarity;
⦁ Semantic search; and
⦁ Text embedding.<p>Not only that but, despite only being trained to guess missing words, EmuBert seems to know facts such as that Norfolk Island is an Australian territory (try the prompt, 'Norfolk Island is an Australian <mask>.'), it is Section 51 of the Constitution that grants Parliament the power to make laws for the peace, order, and good government of the Commonwealth ('Section <mask> of the Constitution grants the Australian Parliament the power to make laws for the peace, order, and good government of the Commonwealth.'), and that the representative of the monarch of Australia is the Governor-General ('The representative of the monarch of Australia is the <mask>-General.').<p>Finally, EmuBert achieves a perplexity of 2.05 on the Open Australian Legal QA, the first open dataset of Australian legal questions and answers, outperforming all known state-of-the-art masked language models, including Roberta, Bert and Legal-Bert.<p>You can check out EmuBert on Hugging Face here: <a href="https://huggingface.co/umarbutler/emubert" rel="nofollow">https://huggingface.co/umarbutler/emubert</a><p>The code I used to create EmuBert is also openly available on GitHub: <a href="https://github.com/umarbutler/emubert-creator">https://github.com/umarbutler/emubert-creator</a>