Show HN: Mandarin Word Segmenter with Translation

48 点作者 routerl3 个月前

I've built mandoBot, a web app that segments and translates Mandarin Chinese text. This is a Django API (using Django-Ninja and PostgreSQL) and a NextJS front-end (with Typescript and Chakra). For a sample of what this app does, head to <a href="https://mandobot.netlify.app/?share_id=e8PZ8KFE5Y" rel="nofollow">https://mandobot.netlify.app/?share_id=e8PZ8KFE5Y</a>. This is my presentation of the first chapter of a classic story from the Republican era of Chinese fiction, Diary of a Madman by Lu Xun. Other chapters are located in the "Reading Room" section of the app.This app exists because reading Mandarin is very hard for learners (like me), since Mandarin text does not separate words using spaces in the same way Western languages do. But extensive reading is the most effective way to learn vocabulary and grammar. Thus, learning Mandarin by reading requires first memorizing hundreds or thousands of words, before you can even know where one word ends and the next word begins.I'm solving this problem by allowing users to input Mandarin text, which is then computationally segmented and machine translated by my server, which also adds dictionary definitions for each word and character. The hard part is the segmentation: it turns out that "Chinese Word Segmentation"[0] is the central problem in Chinese Natural Language Processing; no current solutions reach 100% accuracy, whether they're from Stanford[1], Academia Sinica[2], or Tsing Hua University[3]. This includes every LLM currently available.I could talk about this for hours, but the bottom line is that this app is a way to develop my full-stack skills; the backend should be fast, accurate, secure, well-tested, and well-documented, and the front-end should be pretty, secure, well-tested, responsive, and accessible. I am the sole developer, and I'm open to any comments and suggestions: roberto.loja+hn@gmail.comThanks HN![0] <a href="https://en.wikipedia.org/wiki/Chinese_word-segmented_writing" rel="nofollow">https://en.wikipedia.org/wiki/Chinese_word-segmented_writing</a>[1] <a href="https://nlp.stanford.edu/software/segmenter.shtml" rel="nofollow">https://nlp.stanford.edu/software/segmenter.shtml</a>[2] <a href="https://ckip.iis.sinica.edu.tw/project/ws" rel="nofollow">https://ckip.iis.sinica.edu.tw/project/ws</a>[3] <a href="http://thulac.thunlp.org/" rel="nofollow">http://thulac.thunlp.org/</a>

14 条评论

gwd3 个月前

> Thus, learning Mandarin by reading requires first memorizing hundreds or thousands of words, before you can even know where one word ends and the next word begins.That's not been my experience at all: As long as the content I'm reading is at the right level, I've been able to learn to segment as my vocabulary has grown, and there's always only a few new words that I haven't learned how to recognize in-context yet. Having a good built-in dictionary wherever you're reading (e.g., a Chrome plugin, or Pleco, or whatever) has been helpful here.My fear would be that the longer you put off learning to segment in your head, the harder it will be.My advice for this would be that you present the text as you'd normally see it (e.g., no segmentation), but add aids to help learners see or understand the segmentation. At very least you could have the dictionary pop-up be on the level of the full segmentation, rather than individual characters; and you could consider having it so that as you mouse over a character, it draws a little line under / border around the characters in the same segment. That could allow you to give your brain the little hint it needs to "see" the words segmented "in situ".

评论 #42983866 未加载

imron3 个月前

Nice work OP.I’ve done a fair amount of Chinese language segmentation programming - and yeah it’s not easy, especially as you reach for higher levels of accuracy.You need to put in significant amounts of effort just for less than a few % point increases in accuracy.For my own tools which focus on speed (and used for finding frequently used words in large bodies of text) I ended up opting for a first longest match algorithm.It has a relatively high error rate, but it’s acceptable if you’re only looking for the first few hundred frequently used words.What segmented are you using, or have you developed your own?

评论 #42937651 未加载

greyman3 个月前

OP, thank you for your work, I will continue to watch it.I tried to built something similar, but what I didn't discover and think is crucial is the proper FE: yes, word segmenting is useful, but if I have to click on each word to see its meaning, for how I learn Chinese by reading texts, I still find Zhongwen Chrome extension to be more useful, since I see English meaning quicker, just by hover cursor over the word.In my project, I was trying to display English translation under each Chinese word, which would I think require AI to determine the correct translation, since one cannot just put CC-CEDIT entry there.P.S: I dont know how you built your dictionary, it translated 气功师 as "Aerosolist", which I am not sure what is exactly, but this should be actually two words, not one - correct segmentation and translation is 气功师, "qigong master".

评论 #42982300 未加载

rahimnathwani3 个月前

This is cool. If you haven't already, you might like to take a look at Du Chinese and The Chairman's Bao. They might provide ideas or inspiration.Also the 'clip reader' feature in Pleco is decent.Also, supporting simplified as well as traditional might increase your potential audience.

评论 #42982315 未加载

cannam3 个月前

This was my attempt at doing something a little bit like it, 27 years ago. It's mostly interesting as a historical artifact - certainly yours is a lot more sophisticated and much much prettier! This one just does greedy matching against CEDICT.<a href="https://all-day-breakfast.com/chinese/" rel="nofollow">https://all-day-breakfast.com/chinese/</a>What is kind of interesting is that the script itself (a single Perl CGI script) has survived the passage of time better than the text documenting it.Besides all the broken links, the text refers throughout to Big-5 encoding, and the form at <a href="https://all-day-breakfast.com/chinese/big5-simple.html" rel="nofollow">https://all-day-breakfast.com/chinese/big5-simple.html</a> has a warning that the popups only work in Netscape or MSIE 4. You can now ignore all of that because browsers are more encoding aware (it still uses Big-5 internally but you can paste in Unicode) and the popups work anywhere.

评论 #42982884 未加载

thaumasiotes3 个月前

> Thus, learning Mandarin by reading requires first memorizing hundreds or thousands of words, before you can even know where one word ends and the next word begins.That's not true at all; you can go a long way just by clicking on characters in Pleco, and Pleco's segmentation algorithm is awful. (Specifically, it's greedy "find the longest substring starting at the selected character for which a dictionary entry exists".)Sometimes I go back through very old conversations in Chinese and notice that I completely misunderstood something. That's an unfortunate but normal part of the language-learning process. You don't need full comprehension to learn. What would babies do?

评论 #42981413 未加载

rasulkireev3 个月前

You should add it to Built with Django - <a href="https://builtwithdjango.com/projects/new/" rel="nofollow">https://builtwithdjango.com/projects/new/</a>

carom3 个月前

Did you find the library jieba? That is what I am using for segmentation. It seems to work fine on simplified despite not advertising it.

评论 #42982323 未加载

mindvirus3 个月前

This is great. I'd love it for flashcard creation - paste in a block of text I'm reading and extract vocabulary from it.

评论 #43018977 未加载

sarabande3 个月前

纔 in this case should use the definition of 才 (cai2) not (shan1) which is extremely uncommon. Otherwise, cool app!

评论 #42939276 未加载

georgeplusplus3 个月前

Have you used the app Pleco?That app has been invaluable as someone learning Chinese.that app breaks down mandarin sentences into individual characters. I believe it’s made by a Taiwanese developer too.I tried your app with a few sentences and it works really well!

评论 #42983901 未加载

评论 #42982329 未加载

bnly3 个月前

Nicely done, this looks quite useful!

maxglute3 个月前

Very well executed.

hassleblad233 个月前

Great work OP.

14 条评论

gwd3 个月前

评论 #42983866 未加载

imron3 个月前

评论 #42937651 未加载

greyman3 个月前

评论 #42982300 未加载

rahimnathwani3 个月前

评论 #42982315 未加载

cannam3 个月前

评论 #42982884 未加载

thaumasiotes3 个月前

> Thus, learning Mandarin by reading requires first memorizing hundreds or thousands of words, before you can even know where one word ends and the next word begins.That's not true at all; you can go a long way just by clicking on characters in Pleco, and Pleco's segmentation algorithm is awful. (Specifically, it's greedy "find the longest substring starting at the selected character for which a dictionary entry exists".)Sometimes I go back through very old conversations in Chinese and notice that I completely misunderstood something. That's an unfortunate but normal part of the language-learning process. You don't need full comprehension to learn. What would babies do?

评论 #42981413 未加载

rasulkireev3 个月前

You should add it to Built with Django - <a href="https://builtwithdjango.com/projects/new/" rel="nofollow">https://builtwithdjango.com/projects/new/</a>

carom3 个月前

Did you find the library jieba? That is what I am using for segmentation. It seems to work fine on simplified despite not advertising it.

评论 #42982323 未加载

mindvirus3 个月前

This is great. I'd love it for flashcard creation - paste in a block of text I'm reading and extract vocabulary from it.

评论 #43018977 未加载

sarabande3 个月前

纔 in this case should use the definition of 才 (cai2) not (shan1) which is extremely uncommon. Otherwise, cool app!

评论 #42939276 未加载

georgeplusplus3 个月前

评论 #42983901 未加载

评论 #42982329 未加载

bnly3 个月前

Nicely done, this looks quite useful!

maxglute3 个月前

Very well executed.

hassleblad233 个月前

Great work OP.