Another approach for this is to explore the format through Apple's tools for building dictionaries – as they provide a "Dictionary Development Kit" in Xcode's downloadable "Additional Tools" package (which has documentation for the XML format and a bunch of scripts/binaries for building the bundle).<p>I wound up doing this a while ago for a similar toy project. After some poking around, it turned out that dictionary bundles are entirely supported by system APIs in CoreServices! The APIs are private, but Apple accidentally shipped a header file with documentation for them in the 10.7 SDK [1]. You can load a dictionary with `IDXCreateIndexObject()`, read through its indices with the search methods (and the convenient `kIDXSearchAllMatch`), and get pointers to its entry data with `IDXGetFieldDataPtrs()`.<p>It takes a bit of fiddling to figure out the structure (there are multiple indices for headwords, search keywords, cross-references, etc., and the API is a general-purpose trie library) and request the right fields, but those property lists in the bundle are there to help! (As the author of this article discovered, the entries are compressed and are proceeded with a 4-byte length marker.)<p>[1] <a href="https://github.com/phracker/MacOSX-SDKs/blob/master/MacOSX10.7.sdk/usr/include/IndexedSearch.h" rel="nofollow">https://github.com/phracker/MacOSX-SDKs/blob/master/MacOSX10...</a>
For anyone else needing to tackle something like this, its definitely worth checking out Binwalk [1]. It is meant for extracting firmware but it works decently well on most files-in-files type data formats.<p>[1] <a href="https://github.com/ReFirmLabs/binwalk" rel="nofollow">https://github.com/ReFirmLabs/binwalk</a>
It seems highly likely (given that this is a dictionary and requires fast lookups) that you're reverse engineering something like CEVFS (i.e. a virtual file system for compressing a database). Which is why the dictionary is broken into chunks... these are the compressed pages of the database.
Another way to extract all of those compressed zip files would have been to use binwalk.<p><pre><code> binwalk --dd='.*' *.asset
</code></pre>
Edit: should have read the comments before I posted this, because enragedcacti already mentioned this tool an hour ago.
Thank you for posting this code on Github! There has been some reverse-engineering done on the language dictionaries bundled with Mac OS, and it's nice to know that the same model is being used on the Apple Watch! I look forward to seeing your dictionary app.<p><a href="https://josephg.com/blog/reverse-engineering-apple-dictionaries/" rel="nofollow">https://josephg.com/blog/reverse-engineering-apple-dictionar...</a><p>There's also a command-line tool that can query the dictionary:<p><a href="https://github.com/takumakei/osx-dictionary" rel="nofollow">https://github.com/takumakei/osx-dictionary</a><p>Something I haven't yet reverse-engineered is Apple's word segmentation. I can get the word breaks in Chinese by pressing option + right arrow + space, repeatedly. But I have no idea how the backend for that works.
I have thought about building a vocabulary learning tool for learning Japanese on top of Apple Dictionary. My idea is simple: user collects dicitionary items and the tool offers lookup / spaced-repetition.<p>However, I'm concerned that the dictionary is copyrighted. Is there any precedent that says whether such a tool would be legal/illegal?
This was an amazingly delightful read. I was hoping it would shed some light on a long-time project I had which was reversing the Oxford language dictionaries that were included on CDROM with the big printed texts. (I already did it, but with a debugger instead of by reversing the binary format). Alas, it did not, but it was super encouraging to see the enthusiasm and interest in language dictionaries.
would be really really cool if someone could make a small script to convert these into a format understandable by dict://<p><a href="https://en.wikipedia.org/wiki/DICT" rel="nofollow">https://en.wikipedia.org/wiki/DICT</a>
That's awesome!<p>But be aware (i.e. "beware") that Apple can pull the rug out of unpublished APIs, without warning.<p>I have been caught out, by this, myself.
This is fun. I have another idea: I'd be interested on calling Siri from the command line. Even if using private APIs. (But without hacky fake drivers, or accessibility tools)
I assume one reason Apple has made it more challenging to extract the dictionary resources is in order to satisfy licensing constraints with the dictionary authors. I wonder if they'd block an app like this through the App Store submission process, if submitted.
At the moment, I'm using a slight modification of the gist plus `sed 's/<[^>]*>//g'` to look up words from the shell on Linux. It would be nice to have some XML parsing into plain text, but it kind of works.
I also found this Dictionary API which imports the dictionaries into NodeJs by utilizing a utility called „dedict“.<p><a href="https://github.com/nikvdp/dictionary-api/blob/master/convertDicts.js" rel="nofollow">https://github.com/nikvdp/dictionary-api/blob/master/convert...</a>
Unfortunately it is not possible to implement add-on(add source) to Dictionary.app that is dynamic, like built-in Wikipedia; for example to query urbandictionary.com; only static offline is possible. I tried to investigate this years ago, don't think something changed since then.
Has anyone reversed-engineered the Apple emoji dictionary that maps some keywords to emojis? Last time I checked, they only shipped binaries on the newest MacOS. Would love to use that mapping to elevate search on my custom emoji picker.