I saw a few posts discuss using the Wiktionary dump directly vs. the freeDictionary API, which is difficult to do because the raw wiki text isn't immediately usable. I actually created and open sourced a project several years ago that I never publicized that lexes and parses the Wiktionary dump:<p><a href="https://github.com/vthommeret/glossterm" rel="nofollow">https://github.com/vthommeret/glossterm</a><p>Specifically it can understand and execute 21 different wiki text templates (e.g. "cog", "borrow", "gloss", "prefix", "qualifier”), e.g. {{inh|es|la|gelātus}}:<p><a href="https://github.com/vthommeret/glossterm/tree/master/lib/tpl" rel="nofollow">https://github.com/vthommeret/glossterm/tree/master/lib/tpl</a><p>And eventually parse it into this structure, which has a list of all definitions (distinguished into nouns, adjectives, verbs, adverbs, etc...), etymology, links, and descendants for a given word:<p><a href="https://github.com/vthommeret/glossterm/blob/master/lib/gt/parse.go#L19" rel="nofollow">https://github.com/vthommeret/glossterm/blob/master/lib/gt/p...</a><p>Further parts of the pipeline turned different relationships into edges that I could stick into a graph database and do certain graph queries. This allowed me to do certain queries like find French, Spanish, and English words that share a Latin root.<p>I ended up parallelizing this specific query using Apache Beam and then dumping the results into Firestore so they could be queried via a web app. Here's an example for the Spanish word: helado<p><a href="https://cognate.app/words/es/helado" rel="nofollow">https://cognate.app/words/es/helado</a><p>Under the "Cognates" section, it knows that it comes from the Latin root "gelatus" from which English has borrowed the word "gelato".<p>I originally started this project when I was learning Spanish. If you just look up the definition of helado (ice cream) it doesn't necessarily help you learn it. But I found that if I could relate it to languages I already knew (e.g. English and French), it was easier to remember. In this case helado is related to gelato, but you won't find that in e.g. Google Translate or SpanishDict.<p>Ultimately, I found that while the Wiktionary data is amazing, it’s also a bit of a quagmire for finding cognates. I would miss certain etymologies where you had to follow a descendant tree 2 or 3 levels deep. Or a definition would just mention a word it was related to. But if I expanded the query to include these instances, then it significantly increased the amount of non-cognates that showed up in the results.<p>So I created a useful set of tools (which I never wrote about until now), but I realized the end result of a web UI that showed the relationships between words would require a significant investment in data quality that likely wasn’t possible without changing Wiktionary itself / community investment.